DOCUMENT RESUME 



ED 224 823 

AUTHOR . 
TITLE 

lYlSTITyTION 

SPONS AGENCY 
PUB DATE 
GRANT / 
NOTE 
PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



TM 830 005 



Choppin, Bruce; And Others 

A Critical Comparison of Psychometric Kodels for 
Measuring Achievement. Methodology Project. 
California Univ., Los Angeles. Center fpr the Study 
of Evaluation. 

National inst. of Education (ED), Washington ," DC. 
Nov 82 

NIE-G-80-0112 

279p. ^ 
Reports - Research/Technical (143) 

MF01/PC12 Plus Postage. 

*Acadeinic Achievement; Achievement Tests; Coq^arative 
Analysis; Data Analysis; Item Analysis; *Latent^rait 
Theory; *Mathemat ical Models; Psychometrics ; \ 
♦Testing; Testing Problems; Test Theory ' - 
Generalizability Theory; Rasch Model; Three Parapetei^ 
Mddel 



ABSTRACT • 

A detailed description of five latent structure 
models of achievement measurement is presented. The first project 
paper, by David L. McArthur, analyzes the history of mental testing 
to show how conventional item analysis procedures were developed, and . 
how dissatisfaction with them has. led to fragmentation. The range of 
distinct conceptual and methodological approaches to achievement 
testing that now exist are discussed. The^second paper, by Kenneth A. 
Sirotnik, analyzes measurement in achievement as a central and 
continuing problem in mental testing,- highlighting the differences 
between the modern alternatives. Five papers by David L. McArthur, 
Bruce Choppin, Ronald K. HambletOn, Rand, R. Wilcox and Noreen Webb 
-Tmdtvidually treat Student-Problem (S-Pl chart analysis, the Rasch 
model in item analysis., a three-parameter logistic model, latent 
class models, and generalizability theory. An analysis of reading 
comprehension data by foUr of the contributors and^ Raymond Moy is 
presented. J. Ward Keesling presents 4 summary paper on the empirical 
work carried out so far in testing different models on common set^ of 
data. (Author/PN) ^ \ " 
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INTRODUCTION AND OVERVIEW 

Bruce Choppin 
Center for the Study of Evaluation, UCLA 

t 

The research project reported here developed out of a growing . 
concern at the fragmentation that is oqcurring within the psychometric 
field.. Dissatisfaction with the limitations inherent in traditional 
forms of mental test analysis (as typified by the norm-referenced * 
multiple-choice test of achievement), hashed in recent years to a , 
"variety of new psychometric theories and procedures. The traditional ■• 
approach to testing was developed in order to provide ranking of 
students and/or to select relatively small jjroportions of students for 
special treatment. In these tasks it was -fairly effective, but it is 
increasingly seen as inadequate for the broader spectrum of questions 
that teducational measurement is now called upon to address. Novel 
applications have stimulated new psychometric models and methods, each 
shaped to deal with the specific problems of the particular 

V , 

situation. The last two decades' have seen the development of new 
types of tests, new scoring methods, new procedures for item analysis, 
and entirely new conceptions of the mental measuremeTif't process. 

A marked characteristic of the professional literature on these 
novel approaches to measurement is its parochi/alism. Many of the most 
prolific psychometricians display little interest in models other than 
their own, and there have been few, and mostly inadequate, attempts to 



integrate theories and results. The proponents of different models 
have different objectives; implicitly or explicitly they make 
differfng assumptions; and they frequently use the same words and 
phrases to mean different things (e.g., reliability, accuracy, 
guessing, error and- true-score). Separate methodologies based on 
different models have diverged to a point where it is no longer 
possible to identify a mainstream approach to educational measurement, 
and where informed and balanced" advice on the full range of 
alternative approaches is almost impossible to^ obtain. 

The present project was designed to take advantage of the wide 
range of interest and experience of different approaches to 
measurement jointly '.leld by the professional researchers who 
constitute the "methodology group" at CSE. The project had two 
related goals. The first was to document in some detail the , 
philosophy, assumptions, mathematical procedures, advantages, 
limitations, etc. of each of five different approaches to the 
measurement of achievement that currently command considerable 
psychometric inj;erest. We have tended to describe these five 
approaches as alternative models of achievement measurement, and in 
the strict scier^tific sense this is true, though a comprehensive- 
mathematical formulation is easier for some than for others. This 
detailed documentation would enable us to clarify our understanding of 
the similarities and differences among the models so that we might 
explore with real data the^^consequences of adopting one analytic 
strategy rather than another. 



The second purpose, arising from the first, was to develop a much 
needed "user's quide" that would set out, fairly and comprehensively » ^ 
the rationale "underlying each of . the separate approaches and provide 
sound advice to the potential user regarding the selection of .an 
approach and how these models may be operationalized. 

The-models we consider all belong to the class of latent 
structure models in that their analysis is directed to the inferential 
classification of test items and/or persons, based on theoretical 
assumptions concerning the structure of test data and cohceptual 

theories of measurement. Within this framework, the different models 

<^ 

may be seen as attempts at the solution of a variety of measurement 

V 

problems. Sometimes, even when the models or procedures appear 
similar, the issues of central concern to one may not be of any 
particular interest. to the other. In the measurement area, we meet 
variations in philosophy and value systems as well as in statistic.aV 
referents. — ^ 

A qood example of this ca'n be found iji the recent controversy 
over latent trait models. Although the Rasch one-parameter model and 
the three-parameter model developed by Birnbaum and Lord appear to 
have a lot in common (the Rasch model is mathematically a special case 
of Lord's model) they are conceptually quite distinct. Lord began 
some thirty years ago with large quantities of item response data 
which he wished to understand and explain. For him it was important 

to find a model that fitted his data and could make sen$e of it. 

• I • 

Today his discip^les view the Rasch model as a model tha'|t does not fit 

• » \ 
their^data well. It is founded on assumptions (e;g., nO guessing) 

t 
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which are often not met in pi<:^t1ce. This group of measurement 

^specialists rightly ,di scard the (inexpensive) Rasch modelN'n favor of 
a more complex analysis that better meets their need to "fit" data. 
On the other hand, Rasch was developing his model (during the 1950's), 
not on the basis of actual test data, but rather on a series of , 
principles and axioms for measuremen;t systems that he extracted from 
other realms of scientific experience. He did not create his model 
primarily to explain existing' data Sets, but rather to form the basis 
for constructing new measu-emeht systems. For his followers, test 
items must "fit" the model if they are to be useful for measurement. 
The goal is to find items that do fit the model so as toc permit the 
construction of test instruments with- the optimal properties that 
Rasch described. 

Unfortunately, many psychometricians in each camp have no^t been 
able to appreciate the distinction between these two approaches. 
There have been public debates during with Item Response Theorists 
have' condemned the Rasch. model for not "fitting" real data, while the 
Rasch praefritioners/ attack Item Response Theory for dealing with 
models whose parameters cannot be satisfactorily estimated and which 
do not satvsfy the requirements for "objective measurement". The 
criticisms are sound in themselves, but they do not relate to the 
issues that the other Iside holds to be importai-nt. 

There are other, though perhaps less' dramatic, examples of where 
different ..priorities and different concerns have Ud to some breakdown 
in communication. For example. General izabil ity Theory is directly 
concernec^^with measures, and with analyzing the "errors" associated 

with them. However, it treats these on a grouped basis as, "error 

■ /' ■ 
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variance" and makes certain assumptions about their distribution. By 
contrast, latent trait theorists use "standard error of measurenient" 
on ani individual basis, finding it to be a more useful concept than . 
the conventional one of test reliability. Latent trait theorists also 
make assumptions about the distribution of these errors, and in 
general the^e assumptions are not compatible with those of G-theory. 
Both approaches are useful for solving specific measurement prc^ems, 
but their areas of application are very different. The extent td., 
which the two approaches may be regarded as complementary, and may 
indeed support -one another, is not well understood. 

Our work has addressed these and other questionr. We have 
brought some illumination to previously dark and shaddowy areas where 
two or more of the models come together. 

However, we do not feel that we have yet reached our second 

objective of developing a comprehensive and useful guide to practice. 

More empirical work in comparing the effects of the different models 

needs to be done, and the handbook we wish to develop will contain 

more demonstrations using real data than are found in this report. 

There has not been time in the last twelve months to carry out as much 

of this work as we would have liked, but we feel that we are on the 
* 

right track ^and that our work is sufficiently important for its 
completion to be given some priority. 

The format of the present report is described below. There "are 
two introductory chapters. The first analyzes the history of mental 
testing to show how conventional item analysis procedures were 
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developed (in response to which' pressures and constraints), and how 
dissatisfaction with them has led to fragmentation and the range of 
distinct conceptual and methodological approaches to" achievement 
testing that now exist. The second paper analyzes in. depth a central 
and continuing problem in mental testing, and one which not merely 
illustrates the shortcomings of the traditional approach, but 
highlights the differences between tl^_ modern alternatives* 

There follow five papers treating each of the selected approaches 
individually but according to a standard format. 

These "models" are: thte S-P Chart Analysis developed by Sato 
which may be viewed as a /si nipl if led form of Guttman scaling; two 
late nt trait logistic models (Rasch with one item parameter and Lord 
with three item parameters)' given separate treatment because of the 
philosophical and conceptual contrast cited above; a latent class 
model to which the estimation of true scores is central; and 
General izability Theory which, though somewhat different in scope from 
those mentioned earlier, offers a different mathematical mcdel for 
test data, and some powerful statistical procedures for interpreting 
them. 

Following this we present a summary of the empirical work carried 
out so far in testing out different models on common sets of data. 

In conclusion, a chapter (available only in outline at the 
present) summarizes and synthesizes the earlier parts of the report 
and draws some definitive conclusions regarding the applicability of 
the various models to different measurement problems. 
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EDUCATIONAL TESTING AND MEASUREMENT: A BRIEF HISTORY 

David McArthur 
Center for the Study of Evaluation, UCLA 

Educational assessment in the Western tradition has a long but 
very irregular history. Seven centuries ago, one English college was 
deemed remiss in its responsibilities because its founder had 
determined that its recent graduates "...expressed themselves very 
inaccurately in the learned languages..." (Sylvester, 1970, p. 19) the 
method of such determination was not described. A tradition of oral 
examinations was built up over several centuries, only to disintegrate 
almost completely by the time Isaac Newton attended college ab^t 
1660; not only were there no examinations but frequently the lecturers 
themselves simply never showed up fOr classes. Hov5ever, in another 
hundred years, both Oxford and Cambridge, recognizing the deteriorated 
situation, decided to .improve their curriculum and instituted regular 
written examinations in a variety of topics. The exams of this era 
were almost exclusively essay questions emphasizing factual recall; 
one extant^ example shows eight questions each in history and 
geography, and six in grammar, primarily Latin and Greek. In the 
education of the younger pupils, examinations began to becoirte more 
prevalent as textbooks for the grammar school came to be formulated 
into distinct grade levels. 

The new sequences of textbooks allowed a more precise grading to 
be implemented in schools in various parts of Europe. ..Within the 
school a further step was the development and application of the 
principal of a child's regular progression through grades at 
various intervals of about a year (Bower, 1975, p. 419). 
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The Jesuits, finding that such a procedure' fit perfectly into their 
concept of ratio (the systematicallyorttefed- body of knowledge) took 
up the idea with vigor, and it rapidly spread across Europe. 

Meanwhile, in China, civil service examinations were already , 
several millenia old. The earliest proficiency testing dates from 
2200 ^.C, and formal procedures for examination date from 1115 B.C. 
Despite a concentration on literary rather than managerial skills, the 
system was to be the model for a ^-^umber of efforts at standardizing 
competition for civil service positions in Europe and the U.S. during 
the' i9th century.. But in China the testing system was abolished in 
reforms at the beginning of the 20th century, as Western technologies 
and educational orientations intruded into the Orient (DuBois, 1964, 
1967). 

In the United States, it was not untiri845, following Horace 
Mann's advocacy of written examinations, that t/esting was incorporated 
into educational practice. The first recorded examination was 
administered in Boston that year, and the concept took hold quickly 
(Englehart, 1950). Within thirty-five years, promotion from grade to 
grade was no longer made by personal recommendation b'ut instead 
invariably was judged by success or failure, scored as a percentage, 
on a written exam. Mann's viewpoint of testing, while not using the 
word "objective,"- carried with it a decided bias- towards objective 
, measurement and standard tests (Ruch, 1929). The earliaet objective 
.. educational tests are found in a book complete with questions, answers 

and scales, by an English schoolmaster, dated 1864 (Kelley, 1927). 
' Objective tests in spelling and arithmetic were in place in the U.S. 
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by the 1870's. Then, in 1881, the superintendent of schools, in 
Chicago, expressing a strong serftiment. gainst testing in particular 
(if not against science in general ) -decreed that advancement of 
students was to be carried out only by direct recommendations of 
teachers and principals. Testing for purposes of grade-level 
■advancement was prohibited. His viewpoint was widely shared; 
suddenly, the impetus for ^"objective" measurement and assessment was 
on the wane. "EyminationWor" grade .promotion's were gradually 
abolished in all the best schools," claimed the superintendent's 
successor. "The person, best qualified to judge of a child's ability 
to go on is his teacher. ..To say that any other test is necessary is a 
travesty on common sense" (Bright, 1895, pp. 274-275). By the end of 
the nineteenth cnetury, educational testing had achieved a bad name, 
teachers were "teaching on the test," devoting weeks of preparation 
and drill to extant editions of upcoming exams, and the public was' not 
pleased. 

A completely separate thread in the fabric of educationa.1 
measurement is found. in a review of the history of statistics. The 
first Jectures in statistics date around 1560; the first use of the 
word "statistic" is placed^at 1749, in reference to th^ accounting of 
all' the things that make up a Icingdom (Meitzen, 1891). While ^ 
extensive developnfents in mathematics were being made during this time 
(^Newton, for example, was solving problems in di-fferential calculus by 
1676), the setting out of facts and figures in the sociarsei-eoces for 
many years was limited to tabulations of various facts, actuarial 
tables, and census taking, the first abo'ut 1769 in Denmark. . 
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Interestingly, some recognition of the importance of understanding 

individual differences in mental abilities is founds in the field of 

astronomy by 1822" (Freeman, 1926). It was not until this century that 

the word "statistics" came to refer exclusively to quantitative 

approach^^s; its origins apparently are tied to the Germanic discipline 

called "Stac(tenkunde" or study of ""governments and politics. The 

profession suffered a dec'ine as the old teachers passed away, and the 

task of statistics was made increasingly narrow. 

In 1806 and 1807 a passionate controversy arose against the 
brainless bungling of the number statisticians, the slaves of the^ 
tables, the skeleton-makers of statistics. . .The opponents in the 
sharp attack were themselves, however, not sufficiently clear how 
V ^new and precise limits for their science should be determined. 
(Meitzen, 1891, pp. 49-50). 

An International Statistical Congress was formed to attempt to resolve - 

the confusion; it met first in 1853 and showed a surprising degree of 

success. Even though its members chose to stay out of issues of 

statistical theory, in 1869 one of their resolutions declared: 

...that in all statisticaV researches it is important to know the 
number of observations...; the q^jalitative value is to be 
measured by the divergences of the numbers among themselves as 
well as the average...; it is desirable to calculate. . .thq 
average deviations (Meitzen, 1891, p. 80). 

These principles formed the basis for ttechni-^cal developments in 
educational statistics into the twentieth century: one of the first 
texts (Rugg, 1917) devoted most of its efforts to tabulation, 
averages, frequencies and variabilities. Despite several pioneering 
stales in educational attainment, in large measure the collection and 
analysis of data at this time was confined to tabulations of. school 
attendance and costs. The statistical societies" of the day were deeply 
embroned rn social problems, e^ the relations of education to 




crime, and spent no time at alTo.i assessing educational achievement 
beyond such indices as the ability to sign one's own name (Cull en, 
1975). 

By the middle of the nineteenth century, considerable progress 
had been made in the analysis of experimental data from agricultural 
research. Good experimental designs, including factorial and 
split-plot techniques, were in place about 1850. Galton spent time 
investigating how mathematical solutions might best be developed for 
data from studies of Charles Darwin, buildinc) a number of sta-tistical 
tools in the process, and was the first to atkempt measuring 
characteristics of individual intelligence (1853T:- But it was not 
until Pearson's chi -square test (1900), and Student's t-test (1908) 
that appropriate quantification of educational data could be 
developed^ although the latter, surprisingly, took a number of years 
to catch on (Cochran, 1976). Fisher's analysis of variance (1924) 
drew heavily on these precrsors but it too was relatively slow in 
being incorporated into the repertoire of educational statisticians. 
Guilford's text on fundamental statistics in 1942 awards analysis of 
variance fewer than nine pages, embedded in a chapter on reliability. 

in 1890 appeared the first study of reliability (Edgeworth, 
1890). In the same year. the seminal short article by Catt.el/1 (1890) 
marked the first time the words "mental tests" were used together. 
Following Galton's lead, several investigators in Germany began to 
develop mental tests, and in the U.S. there was extensive interest in 
the relationship of mental capacities to physical characteristics. 
• The American Psychological Ass'ociation set up a standing committee in 
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1895 to consider cooperative efforts in mental and physical 
statistics ;.tlie American Association for the Advancement of Science 
did likewise the following year.- Binet, who had been working on 
problems in mental reasoning since 1886, wrote an important article in 
1898 on the utility of measurement and scaling in the appraisal of 
human intelligence. However, two major studies of testing around this 
time (Sharp, 1899; Wissler, 1901) concluded that many of the available 
tests used for psychological research fell far short of their claims, 
in both content and method (Peterson, 1925). In education. Rice's 
(1897) study of spelling attainment, using a single list of 50 words 
in a test administered to 30,000 children, was a pioneering study, 
which circulated widely but gained few supporters (Wilds & Lottich, 
1970). 

About the turn of the century there was a fair degree of public 
discouragement about educational testing. However, about'this time, 
the first survey of school facilities anc^educational practice was 
conducted, the College Entrance Examination Board was established, and 
in 1902 the fvrst course in educational measur«iiie«t^as taught, (by 
Thorndike at Columbia) (MeyeV, 1965). Concurrently, interest in the 
concept of general intelligence was being pursued by a number;' of ( 
investigators, following a suggestion by Gal ton in 1883 and\a study of 
1,500 children conducted in 1891 (Burt, 1909)1 In the analysis of 
re'sults from the latter investigation, however, came the^ e/plicit 
realization that statistical methods for educational measurement were 
in desperate need of thoughtful improvement. Burt speculated that .the 
consistent failures of research investigations in the area of general 
intelligence before the turn of the century 



were largely due to their reliance for discovery of correlations 
upon mere inspections of the 'data they obtained, instead of upon 
quantitative determination -and mathematical deduction (pp. 94-95). 

During the first decade of the twentieth century, the growing Impetus 

for increased statistical rigor could be felt in several areas; 

measurement successes in anthropometry and biology provided much 

needed support for such improvement. In 1904, Toulouse and Pieron's 

tw(^ volume manual'' on laboratory experiments included sections on 

intelligence and the measurement of individual 'differences. In 1906 

the American Psychological Association created a permanent committee 

charged with evaluating requirements for standard laboratory technique 

and appraising both group and individual tests' with attention to 

practical applications. Binet's test for intelligence (1905) and 

Thorndike's book on mental measurement (1904) had particular 

. significance during this time, as did Spearman's (1904) pa'per on 

general intelligence. By 1910,' a vast number of tests in skills like 

English, spelling, handwriting, reading and arithmetic had emerged, 

followed closely by more technical articles on topics like numerical 

analysis, standardization, validity and correlations. 

... 'American educators quickly realized that the scale idea could 
be applied not only to intelligence but to achievement as well. 
There followed a phenomenally creative period during which 
testmakers developed ins1:ruments for virtually every aspect of , 
educational practice (Cremin, 1961, p. 186). 

In 1913, the National Council of Education released a major 
report on standards and tests for measuri ng. school efficiency, and 
expressed this sentiment: 

We are only begining toihave measurement undertaken in terms of 
standards or units whiisrh arei or 'may become, commonly 
recognized. Such standards will undoubtedly be developed by 
means of applying scientifically derived scales of measurement to 
many systems of schools. From such measurements it will be 
possible to describe accurately the accomplishment of children 
and to derive a series of standards.. .(Strayer, 1913, p. 4). 
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Graves, reviewing the condition of education in 1913, expressed the . 
sentiment that the application Df mathematics to measurements in 
education was one of the most significant movements' of that time. 

Developments in objecti ve measurement of intelligence and 
educational achievement came to a head with the crisis of the Great 
W^. Work in Germany on the screening >of inductees had t^een in 
progress since 1905; Binet and Simon (1910) dismissed the application 
of intelligence testing in the Fren|h army (Peterson, 1925). In the 
U.s\, Temian's revision of the Binet scale was completed, by 1917, and 
was applied soon thereafter to the testing of 1.7 million recri\its. ^ 
small team of educational psycho! ogist^ produced the Ai^y Alpl^a and 
■^eta tests of intelligence between May 28 and June 10, 1917; a copy of" 
the examiner's manual was enroute to the printer within a month. 
Immediately after the war, as the Army was selling thd'usands of unused 
test blanks, both educational specialists and the public began to 
realize that objective test results had to be tal^en with some degree 
of caution. One of the originators of the Army Alpha expressed the 
sentiment unambiguously: "We do not know what^intel 1 igence is and it • 
is doubtful if we will ever know what knowledge is" (Goddard, 1922, 
quoted in Spring, J.972, p. 5). Even so,) by 1920, objective testing 
formed the core of educational assessment methods. The Journal of 
Edu cational Measuremenjt'^ devoted several issues in 1921 to a symposium 
on scientific measurement of intelligence. 

During the decade that followed, the objective assessment of 
intelligence "swept America, and to a lesser extent Canada, like an 
educational crusade. ,. The critics were numerous but few in comparison 
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to the advocates... "(Marks, 1976, p.lO). . McCall*s (1922) book on 
educational measurement and^Monroe's (1923) the following year were 
the first to set out the procedures for a ""new type examination," the 
multiple-choice aind true-false tests. Principles of test construction 
began to earn chai)ters of their owiy and the variety of 
interpretations ahd uses of tests was becoming a major consideration 
for.' many educators (Mrniroe, 1945). Then came the first contributions 
to What is now recognized ,as classical test theory: Thurstone's \ 
(1925, 1926, 1927) articles on the scoring of individual performance^, 
Ruch and DeGraff's (1926) study of corrections for guessing, Ruch's 
(1929) Thft Objective or New. Type Examination , and Thurstone's The 
Reliability and Validity of Tests , 1931. 

The concept of reliability is illustrative of the historical 
development of educational measurement. Because of its basis in 
correlational method, whtch was already well advanced at the turn of 
the century, a number of technical articles appeared quite early 
concerning the statistical nature of reliability indices. By the time 
that a major study was launched in the late 1920's by the American - 
Historical Association's Commission on i^he Social-Studies into the 
nature of testing in social sciences education, reliability measures 
were regarded as essential by technical sJpecialists but generally 
'disrega.Hed by practitioners. Under the counsel of Truman Kelley, a 
large-scale investigation was conducted on the use of tests for 
determining. overall class and school perfoPfflance,^(;ecogni zing 
Individual skill levels and individual differences, and appraising 
attitudes and personality traits. It also studied the utility of the 



"new-type" tests. In the long run both the social science specialists 

and the educational measurement technicians were disappointed in the 

results of the study. The former were not pleased by the tendency of 

short -answer anci multiple-choice tests towards fragmentary 

presentation of, and limitations to, simple facts in the curriculum 

and the deletion of shades of-meaning. The latter felt that lack of 

objective terms, which they saw as essential for objective 

measurement, obviated the study's conclusions. Kelley's feelings were 

sufficiently strong that he wrote a 15-page appendix entitled "A 

Divergent Opinion as to the Function of Tests and Testing" in which he 

excoriated the opponents of testing with more than a dozen carefully 

reasoned arguments regarding the appropriate scientific use of 

educational tests, plus 'one or twp direct strikes to the more 

emotional nature of the argument: 

The opponents (of testing) show, no awareness of the tests of 
^ reliability and validity of measuring instruments, either 
judgments of teachers or of test scores. We believe that such 
awareness is essential to any educator who is not content to work 
in the dark (p. 489). ^ 

In the areas of reliability and validity, technical proofs were 
available as eaVly as 1910 (Spearman, 1910) providing a rationale 
behind error measurement and Brown (1910) giving a definition of true 
score. But it was some time before either term was given serious 
treatment in the standard texts. Tak.ing a representative contribution 
from each decade, we find a half-dozen index entries fn Rugg's 1917 



text, 18 entries between the two in Ruch's 1929 text, four chapters in 
his 1942 book, and eight full chapters devoted to the two topics in 
Gulliksen's 1950 text. However, by the 1930's there had accumulated a 
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variety of estimation procedures and a great deal of confusion of 
terms (Adams, 1936; Barthelmess, 1931; Lincoln, 1932). An attempt to 
resolve the issues was made in Thurstone's small book on the topic in 
1931, another in Kuder and Richardson's (l'9S7) key article in test 
reliability, followed by Guttman's (1945) reformulation and Cronbach's 
(1947) discusion of the severaf different kinds of reliability 
coefficients. The American Psychological Association tried to resolve 
the various 'discrepancies by committee in/ 1954. Tryon (1957) provided 
an extensive historical review of the reliability concept and a 
domain-sampling reformulation. "The extraordinarily massive 
literature in this topic," wrote Cattell (1964), "...has never lacked 
statistical finesse and mathematical virtuosity (p. 1)", but he, too, 
felt a need to suggest substantial redefinitions for both reliability 
and validity, which in turn were ignored four years later with 
publication of a definitive mathematical analysis by Lord an^ Novick 
(1968). 

The first formulations of a 'sample-free' approach to tnental 
measurement are found in Lawley's* (1943) analysis of item selection. 
Although the problem had been explored tangentially by Horst (1936) 
and more recently by Ferguson (1942), his paper was among the earliest 
to seek mathematically rigorous justifications for the selection of 
maximally discriminating test items, and to examine in some detail the 
concept of item characteristic curves. Tucker (1946) provided further 
statistical support. Gulliksen (1950) summarized the early work in 
true score theory, and Lord explored the application of latent trait 
theory to test theory with his doctoral dissertation, published as 
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/ 

/ 

/ 

/ 

Theory of Test Scores (1952). Interestii/gly. he felt that the actual 
utMity of large portions of the theory /would be limited in practice 

by the difficulty in obtaining sufficiptly large data sets, and did 

/ 

not publish about the problem again f?ir another ten years. At that 

1 

point he presented an important development, the beta-binomial model 
of the -frequency distribution of tr,6e scores and raw scores (Keats & 
Lord, 1962), and further refined t,lie definition of true scores in Lord 
& Novick (1968); Meanwhile, Birniiaum explored certain statistical 
properties of normal and logistic characteristic functions in 1957 and 
1958, but few other papers on tfiis topic appeared until the 196o's. 

The sentiment has been expressed more thah once that the science 
of educational Resting has progressed^ fitful ly. Despite a plethora of 
statistical developments, "most of the major theoretical and technical 
distinctions and most of the principle points of dispute were in 
exlstence^y 1925" (Thomson & Sharp, 1983). This includes such 
diverse topics as item analysis, t^st bias, the nature vs. nurture 
arguments regarding individual Intelligence, and at least the j 
beginnings of factor structure explanations for educational 
assessment. 
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TOWARDS MORE SENSIBLE ACHIEVEMENT 
MEASUREMENT: A VIE^' AND REVIEW • 

' " . ' [ . , ' ' 

Kenneth A. Sirotnik- 
Center for the Study of Evaluation, UCLA 

■ ■ ' ' ' ■ 

Introdirction , 

Much of what will f^)llow here fs a repeat of an unfamil iar— or at 
least unpopular— theme. The essence of this theme has been either impli- 
cit or explicit in writings dating as far back as the early 1930' s and 
continuing up to thp present. (See, for example. Walker, 1931; Guttman, 
1944; Loevinger, 1947, 1948, 1954; Rasch, 1960; Lumsden, 1961; Bentler, 
1971; and Wright and Stone, 1979.) Probably the most entertaining and 
insightful review is "a rarely quoted article by Luijisden (1976) These 
authors all propose different techniques (or variants of the same techniques) 
and analytic models for scaling the itQTis on the ordinary test of achieve- 
ment. But they all have two basic things in common: (1) they are critical 
of, and represent alternatives to, classical, test theory and (2) they op- 
erate from fundamentally the same notion of what it mfeans t6 measure. The 
essence of the common theme isV^ntly, that classical (and classical-like) 
test theories are not very useful when it comes to test construction and 

analysis . ✓ 

Why has not the nearly exclusive practice of traditional^ test theory 
methods abated during the last fifty years? Why does nearly every new issue 
of journals like Psychometrika or Educational and Psychological Measurement 
contain yet another theoretical exposition involving true and error score 
theory or some esoteric reformulation of the same old reliability coefficient? 
Were the above authors and others like them just on a flight of fancy pro- 
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posing crazy ideas that happened to escape the eyes of critical reviewers? 
No! They merely challenged what to date^ amounts to over 70 years' worth 
of archives of scholarly work-on test theory models bearing little resemblance 
to how people ordinarily think about what it really means to measure. To 
be surei each challenge did not offer a completely viable alternative to 
common practices But it seems to be part of the human condition to hang on 
tenaciously to the familiar, to the security of a large investment, at least 
until the market crashes and/or the tide of opinion noticeably changes 
through the power of advertisement. 

Such has been the case recently with the increased use of latent trait 
models, particularly the model proposed by Rasch (I960) and popularized in 
the U. S. by Wright (1968, 1969 [with PanchapakesonJ , 1977, and 1979 [wil^h 
Stone]). The point of this report is not, however, to advertise any par- 
ticular measurement model. Rather, I wish to continue advertising the 
self-evjdent notion that how one conceptualizes the act of measurement 

> 

should have a lot to do with how one analyses the quality of the measure- / 
ment act during its development, implementation and revision phases. 

I will restrict this discussion, to the measurement of achievement 
with items of the usual correct-incorrect . (1-0) variety. (However, the 
basic notions are generallzable to ordered response scales more typical in 
the measurement of values, attitudes, beliefs, opinions, etc.) My point 
^View regarding how the measurement act is ordinarily conceptualized is not 
original nor very creative. It rests simply on analogy with measurement 
in the physical sciences where constructs are often experienced with the 
senses. The measurement of length, in particular^ ^ person's height, is 
the usual example and will serve well here. Certainly most constructs we 
attempt to measure in the behavioral sciences are not directly experienced 
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and this, of course, constitutes the main source of difficulty. But it 
does not follow, necessarily , that the generic notions of measurement be 
any different. Nor does it follow that measurement models be deterministic 
i.e., be developed in ideal terms from which deviations are unaccounted 
for. Probabilistic models are those wherein all deviations from the model 
have an expected probability of occurance. Both deterministic and prob- 
abilistic models exist in both the physical and behavioral sciences. 

Implicit in this view of measurement is an assumption that the test 
items are all measuring the same thing (construct, trait, etc.). Extant 
psychometric literature is replete with confusion over what exactly is 
meant by this assumption and the two commonly used terms — uni dimensional 
and homogeneou^ — referencing sometimes similar and sometimes dissimilar 
empirical interpretations of this assumption. The confusion, not surpris- 
ingly, reduces down to different views of the measurement act. Viewed in 
its original factor analytic sense, unidime^nsionality refers to one inter- 
pretable conmon factor explaining the item correlation matrix. This fits 
well with the notion of measurement as repeated single-item tests and the ; 
concept of reliability as internal consistency. But .internal consistency 
is only a necessary and not a sufficient condition for a. single common 
factor. in an item set; yet, many traditional test theorists (e.g., Gulliksen, 
1950; Ghis^lli, 1964; Magnusson, 1966; and Allen, and Yen, 1979) and prac- ^ 
titioners have used both unidiraensional ity and' homogeneity in reference 
to the internal consistency of a set of items. 



To confuse the issue further, Guttman's (1944) "unidimensional ity" 
and Loevinger's (1947) "homogeneity" both, in empirical consequence, refer 
to the cumulative ordering or scaling of a set of items — a fundamentally 
different notion of the use of items to measure a single construct. The 
analogue of this nation for probabilistic models (e.g., latent class and 
latent trait models) is the concept of local independence , taken by many 
latent trait theorists (e.g.. Lord & Novick, 1968: Hambleton & Cook, 1977; 
and Lord, 1980) as the equivalent of the assumption of unidimensional ity . 

(But see the discussion of Traub and Wolfe, 1981, p. 387.) 

From point of view, I assume that there exist*^ufficiently ' 

singular achievement constructs, represented by it^ sets, that are 

psychologically interpretable and that are of potential instructional 

use. A reasofiably successful application of a measurement strategy is 

necessary but not sufficient evidence for a reasonably successful effort 

at measuring a singular construct. In other words, a singular construct 

is' assumed at the outset; a priori verification of the assumption, is, 

in essence, an exercise in content validity; necessary a posteriori evidence 

lies, in. essence, in the degree of success in developing the measurement 

device; sufficient evidence, however, is accumulated only through further 

construct validation studies. 

In what follows, a common conceptual view of the act of measurement 

\ 

will be presented and contrasted, in general, with the act as implied by 
traditional test theories. This discussion will then be punctuated by a , 
more specific overview oV several traditional test theories to illustrate 
the issue further. Finally, alternative models will be reviewed which 

3 

are more in line with how the measurement act- is ordinarily conceived. 
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€P' Precision and Accuracy: Disentangling the Concepts 

^ ^ 4 
Measurement and Dependability 



It is important, first, to define measurement more explicity. Many 
definitions have been proposed resulting in disputes over what ^does and 
does; not constitute measurement. My interest is not to debate the 
issue at a philosophical level, but rather to simply clarify how the 
term will be used here. It will serve my purposes well by following 
the lead of Torgerson (1958) who reserves the use of the term measure- 
ment as follows: 

The logic of measurement deals with the conditions necessary 
for the construction of a scale or measuring device. Measure- 
ment as used here refers to the process by which the yardstick 
is developed, and not to its use once it has been established, 
in, say, determining the length of a desk. It is essential 
that we keep this distinction in mind. The use of^the estab- 
Tis*hed yardstick in "making a measurement", is a rather simple 
procedure involving merely the comparison of the quantity to 
be measured with standard series, or perhaps only reading the ^ 
pointer or counter of an instrument designed for the purpose. 
We are here concerned with the more basic problem of estab- 
lishing a^suitable scale of measurement. 

measurement pertains to properties of objects, and not to 

the objects themselves. Thus, a stick i§ not measurable in our 
use of the term although its length , weight , diameter, and 
hardness might, well be. 

Measurement of a property then involves the assignment of numbers 
to systems to represent that property. In order to represent 
the property, an isomorphism, i.e., a one-to-one relationship 
must obtain between certain characteristics of the number system 
involved and the relations between various quantities (instances) 
of the property to "be measured. • 

The essence of the procedure is the assignment of numbers in 
such a way as to reflect this one-to-one correspondence between 
these characteristics of the num&ers and the corresponding re- 
lations between the quantities. (PP- 14-15) 

Implicit in this usage is the preference not to use the term measure- 
ment in the broader sense of Stevens' classic definition: "Measurement is 
the assignment of numerals to objects or events according to rules." 

I 



ERIC 



32 



(Stevens,: 1951, p. 22.) Nominal scales, therefore, are not the result of 

measuremenf but of classification . Measurement presupposes, therefore, 

that the object has a property that exists in magnitudes that can be 

represented on either ordinal, interval or ratio scales. And again I 

align myself with Torg who finds it uninteresting to worry about 

what is or is not "permissable," in . pi/actice , with measurement scales 

of these Several types: 

a major share of the results of the field of mental testing 

and. of the quantitative assessment of personality traits has 
depended upon measurement by fiat. This is clearu for example, 
when curves, are fitted by the process of least sqiJ^ares or when 
product-moment correlations, means, or standard deviations are ■- 
computed. All of these presuppose that distance has meaning. 
Hence, either explicitly or implicitly, the experimenter is 
measuriTT^" the attribute on an interval scale whose order and 
distance characteri-stics have 'bbtai ned meaning initially through 
definition alone. 

The discovery of stable relationships among variables so measured 
■ can be as important as among variables measured. in other ways. 
Indeed-, it really makes Tittle difference whether [a] scale of 
length, for example, had been obtained originally through ar- 
bitrary definition, through a relation with other established 
variables, or through a fundamental process. The concept is , 
a good one.. It has entered into an immense number of simple 
relations with other variables. And this is, after all, the 
major criter,io|j,sof^the value of a concept, (p. 24) 

The "act" of measurement, then, refers generally to both the logic 

of measurement and the process of constructing a test, i.e., a rule or 

set of procedures operational i zing the construct in a manner consistent 

with the logic of measurement. What, then, is a test theory ? I would 

prefer that the. phrase* "test theory" denote the complete act of not only 

constructing the measuring instrument, but also of assessing further the 
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validity of that instrumentx including its dependability under specified 
conditions- of use. In other words a theory of testing, to be complete, 
must include a measurement model , a dependability model and a validity 
theory. This last ingredient really includes (and goes beyond) the mea- 
surement and dependability mode«"and>s what justifies the usage of the 
term "theory." I know of nc/past or current "test theory" that deals ex- 
plicitly with all three aspecb^. Traditional test theories are theories of 
dependability Csome more restricted than others) with some validity theory. 
The newer latent trait models are just that, models for measuririg ^a 
presumed construct. The focus of this paper is clearly on measurement, • 
but by way of contrasting the act of measurement with the dependability 

of obtained measures. 

Now suppose we had before us a small collection of the usual multiple- 
choice (or true false, completion, etc.) items of the type commonly found 
on a test designed to measure a specific achievement outcome. On their 
face', all such tests 'Mook alike." However, depending upon the conceptual 
model of measurement underlying the analytical process for selecting these 
items, this innocent looking collection could be quite different in terms 
of item composition and empirical characteristics. It is the contention 
here that classical theory is conspicuously lacking in explicit regard 
for the potential value of the individual item. By this I mean that there 
is no explicit recognition of the measurement function served by items. 
Classical true and error models characterize the consequence of applying 
a measurement rule- -they do not characterize the essence of the ru]e itself. 
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L^t's ciDnsider the "essence of a measurement rule" by continuing the* . 
analogy with measuring a person's height. Iri measuring height, a tape 
measure and its properties operationalize the rule . , Instead of *'tape mea- 
sure," let's use the simpler term "ruler." Suppose we us.e a ruler (of 
sufficient length) to measure peoples* heights. Traditional test 
theories have a lot to say about what to do with the obtained measurement; they 
have' little to say, however, about how the ruler is constructed in order to ob- 
tain the measure, i.e., how the ruler is calibrated and how a numerical result 
eventually becomes associated with each person as a quantitative indicant 
of the height of the person. In other words, rather Xban the question of 
precision with which any given measurement is obtained, traditional test 
theories take the measurements as given and pursue the question of accuracy a 
i.e., how consistent the measurement rule is over repeated applications. 

Precision and accuracy are cornerstone concepts of any theory of 
approximate numbers. They reflect fundamental ly different ideas in^the 
measuren^ont process. Yet they are used inter-changeably in -the behavioral 
sciences as a synonym for reliability. Two examples out of many are the 
following quotes: 
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The physical scientist generally has expressed the 
accuracy of his observations in terms of the varia- 
tion of repeated observations, of the same event. The 
mean of the squared deviations of these observations 
about the obtained mean is the "error variance." This 

is a measure of precision or reliability We regard 

reliability as the consistency of r'epeatfed measure- 
ments of the same event by the same process 

(Cronbach, 1947, p. 1,) 

' Reliability of measurement, then, pertains to the pre- 

cTsion with which some trait is ^easured by means of 

specified dpjferations Such indices will be useful 

for compariuj^jiifferent tests so we can ascertain 
which gives us the most precise or stable scores, 
an& will permit us to ascertain whether the relia- 
' bility with which a test measures is sufficient for 

our purposes Casting reliabil ity in terms of the 

coefficient of correlation between parallel tests pro- 
vides another way of describing the precision of 
measurement. (Ghiselli, 19^4, pp. 215-218.) 

In.the^hysical sciences, the concepts of precision and accuracy 
are clearly distinguished although not always in the same way. In the 
absence of empirical error, a measurement m precise to the nearest 
u- unit has ars inherent absolute error equal to ± u/2 . In. this case, 
accuracy becomes relative error due to imprecision, i.e., (u/2)/m. But 
when empirica/ error exists— that is, error due to the measurer, the 
measuree, and/or the measurement; circumstances--accuracy (not precision) 
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is usually defined" as in the first sentence of Cranbach's (1947) quote 
above. The d-^tion^'ry is of little help in sorting out any systematic ^ 
distinctions. For example, Webster's New World Dictionary (College Edi- 
tion) gives us this\definition: " Precision , the quality of being precise; 
exactness; accuracy." And in the same dictionary, is this definition: 
" Accuracy ,- the quality of being accurate or exact;^ precision. " 

At the risk of confusing the issues further, I will elect the versions 
of these two concepts that serve to *keep two fundamental properties of the 
measurement act separable. Suppose in measuring the height of a perso/i, 
the ruler is marked off in feet; we can then measiiVe anybody's height to 
the nearest foot. This is a statement of precision . Includedo'n this 
tion of precision is the overall length of the ruler. If it is only 5 feet 
long, the measurement of people -ver 5 feet tall would necessarily be much 
less precise. Precision's intrinsic in the construction of the measur- 
ing instrument; it can be increased by conceptualizing and adding more hash 
marks to the ruler. Half feet can be added to the ruler enabling the measure- 
ment of height to be precise to the nearest half foot. It is not really 
necessary that the hash marks be at equal intervals, or that the addition 
of hash marks be midpoi^nj:s of each interval. 

Possibly a better conceptualization o/ precision is gained by defining it 
as the number of measurement decisions an instrument can potentially make. 
The ruler calibrated in half feet can- potentially make twice the number of 
relative height decisions as can the ruler calibrated in feet. i 
To facilitate the analogy with test items, the ruler can be reconcep- 

V it 

tualized as a collection of straight sticks consisting of a 1-foot stick, 

0 

a 2-foot stick, a 3-foot stick, and so on. The more precise ruler is re- 
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conceptualized a set con5isting of a 1-foot -sTtick, a l^foet stick, 
a 2"-foot stick, a 2Vfoot stick, etc. Measurement of height, then, xt 
the process of isolating two adjacent (ordinal ity being assumed) sticks 
within which lies the height in question and judging which of these sticks 
is closest, i.e., to within u/2 units where u is the unit of precision. 
Alternatively, the measure of a person's height is the number of sticks 
surpassed by the person's height (plus u/2). If the person is judged to 
be shorter (by u/2 or more) than the stick, he/she is scored zero; if 
taller, he/she is scored one. The person's height is then the total score 
after being tested on the set of sticks. Figure 1 lays out the process 
schematically. Whether sticks are ordered as calibratiorr marks on a ruler 
or unordered and used sumnatively, the result is the same: the person's 
height is judged to be 3 feet to the nearest foot. That is, the person's • 
height is somewhere in the theoretical interval of 2% to 3% feet. Preci- 
sion is inherent in the way in which the measuring instrument is calibrated 
made operational . 

Accuracy is reserved here as a term for describing the degree to which 
.the use of the measuring instrument is error-free. Accuracy is an em- 
pirical concept given an already calibi|ated instrument. Indexing the level 
of accuracy involves repeated measurements under the circumstances in which 
accuracy is required. In the above example, to the extent that we can con- 
sistently arri.ve at (or close to) the same measurement; of height (to the 
nearest foot or half-foot depending upon which ruler we use), we have an 
accurate measuring procedure. The more accurate the procedure the less 
variability in obtained measurements over repeated measurement trials. 
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Figure 1 

Schematic Representation of the 
Act of Measurenent 
(Height as an Example) 
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The complete independence of the concepts of precision and accuracy 
shouVd^ be clear: A highly precise instrument can be grossly inaccurate 
(a rubber measuring stick calibrated to the 32nd of an inch) compared 
to the acc^jracy of a less precise instrument (a steel measuring stick 

calibrated \n yards). Moreover, accuracy is a function not only of instru- 

'\ 

ment "decay," but also of the circumstances under which it is used. Tech- 

-5 ■ \ 

nically, therefore, we^assess the accuracy of the measurement procedure 
which includes, error due to the instrument itself, the person dofing the 

measurement, the person being measured, and the environment in Which the 

■ . I 
measurement process takes place. 

Given this distinction, reliability (or, more generally, dependability ) , 
as defined by classical (and classical-like) test theory models is clearly 
a synonym for the accuracy of a test.- Empirically and theoretically, the con- 
cepts of reliability and dependability have been concepts of repeated measure- 
ments. In this sense, it matters little whether the repeated measurements 
are replicates Cstrictly par-allel) or samples from a domain (randomly parallel); 
that is, the generic concept of accuracy remains. intact regardless of the 
conceptual changes in meaning of "true score" implied by the several classical 
models. So long as we envision only the composite result of the testing pro- 
cess, the classical Models are quite analogous to the physical model of mea-, 
surement. The test score is analogous to the "ruler score," i.e., the obtained 
height measurement. If we are interested in assessing the accuracy of a single 
ruler, then we could use the original classical test theory model of strictly 
parallel repeatpd measurements. If, instead, we are more interested in the 
accuracy of a variety of rulers (wood, steel, cloth, etc.) from different 
manufacturers, then, the item sampling models of randomly parallel repeated 
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measurements. would be useful. * The domain of generalizability changes, 
but the notion of accuracy does not--empirical estimates obtain through 
repeated measurements, either with the same ruler (strict parallelism) 
or with a sample of rulers (random parallelism). 

However, the physical model and traditional test theory models part 
company Chen it COTies to the notion of internal consistency . Inquiry 
.into- the internal consistency of a. ruler would be directed at the verifi- 
cation of the calibrations viya vis the construct in question and the*., 
sele^ed measurement unit standard--an investigation of the precision of 
measurement. In test theory, the inquiry is directed, as it should be, 
toward the i tems > But in traditional theories, the inquiry proceeds by 
simply recasting items into the same role|as .the test, viz,, repeated 
measurements—an investigation of the accu^a,cy of measurement. 

Where in the traditional.test theory models is the concept of preci- 
sion? Conceptually speaking, the answer is^, "Nowhere.** Now of course 

5 

preQTsion is manifested in J;ha-test item, in particular, the difficulty 
of the teslrTfemT'^A^ a more difficult test item evidences 

more ability than does a student who can pass only a less difficult item. 
The analogy with Figure 1 should be clear. The collection of items is 
the ruler, conceptualized as an ordered bundle of sticks. The item diffi- 
culties are analogous to the lengths of the sticks. Measuring the ability 

of a student involve? locating that pair of adjacent items B and A such' . 

f . ' I 

that the student correctly answers B (and all other items easier than B) 

bot not A (nor all other items more difficult than A), Traditionally, the 

student's me.asure is the ordinal position^ of item B, or, equivalently, the 
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total number of items, answered correctly by the student. 

Certainly this analogy is lacking in some non-trivial respects. In 
particular, the determinacy in the ordering of sticks is hardly (if ever) 
realized in the ordering of items • If stick C is shorter t^han ^ick B, 
and a student's height surpas'Bes the length of stick B, then^it wi IT surely 
pass that of stick C. Such is the beauty of measuring constructs we can 
understand with our senses. But if item C is easier than item B, and a' 
student correctly answers item B, then* it is not always a sure bet that 
he/she will correctly answer item C as well.^ Such is the legacy of the 
attempt to measure abstract behavioral constructs. Moreover, the proce- 
dure for assigning an invariant metric to the measurement of height is 
straightforward^ t is much less so when using items to measure ability. 

But I believe these to be minor details compared to the conceptual 
identity between sticks and items and their role as. calibrations 
on the "ruler." The point to be made here is that this is not the role 
cast for items by classical (or classical -like) test theories. Lest I 
may have begun to lose som^ readers who are rusty on classical (and what 
I am referring to as claislcal-l ike) test theory, I will turn to an over- 
view of several such theories iJftfr the expressed intent of further illus- 
trating the argument thus far presented. (Readers already familiar with 
these models may skip to the Discussion in the next section with little 

or no loss in continuity.) 

Traditional Test Theories 

Some would probably argue (and justifiably so) that the sampling of 

alternative approaches to follow should not be lumped into a single class 

of test theories, especially one including classical test theory. I do 



O • L do 

ERIC V , 



* 2. 16- 



^ this here only because, in terms of their fundamental conceptualization 
of the measurement process and important empirical conseq'^ences , they are 
^ more sinlilar to each other than to the models to be discusser, next. 

Classical Test Theory 

The basic postulate of classical test theory defines a belief regard- 
ing the composition of the raw score obtained by a s tudent, ^namely , that^ 
this observed score is simply the student's true score plus what's left 
over, commonly designated as the error score. 

Using some fairly standard jnotation and the usual matrix" layout of 
the scores of n students^on k items, we obtafn the schematic in Figure ^2. 
Using T and E for true and error scores, the classical test theory model 
posits for any student s that: 

, . . S S S 

A number of relationships obtain from this model when several additional 
assumptions are made about the true and error score components of repeated 
measurements on any student^ Specifically, these assumptions are (a) 
errors are totally random and cancel each' other out; therefore, the mean 
error is zero (E= 0); (b) the correlation between true and error score 
components is zero (pj^ = 0);'and (c) the correlation between errors 
over repeated measurements is zero (p^^* 0). 

Assumption (b)Mepds directly to the variance composition of 

the linear model above, viz., observed score variability is the sum of 
variability in true and error scores: 
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Figure 2 ' ^ 

' • / 

Student- by- item raw score matrix and notation, (x^^. = 1 or 0 
if student s answers item i correctly or incorrectly.) 



1 
2 
3 



Item Difficulties 



Items 



1 /2 3 • . 


. i . . k 






/t 1 \ L. 


11 IN 




^i 


\ 




\^ 


• ^i • • 


■ Pi • • • ; Pi '■ 





Raw Composite 
Scores 



X, 



k 

= ^ Si 



ERIC 



44 



-2.18- 



ERIC 



Assumption (c) leads further to the fundamental theorem that the covariance 
between observed scores on any two repeated measurements is equal to that 
between the true scores on these measurements: 



^XX^ ^X ^X' " ^n' ^T' 

Finally, if a fourth assumption is added— (d) the repeated measure- 
ments are parallel measurements where parallel measurements are defined as 

* ■ 2 2, 

having equal true scores (T = T ) and equal error variances (a^ = ) — 
then reliability (defined as the correlation between parallel measures, "xX' 
Pyw) is the equivalent of the ratio of true score to observed score vari- 

AA 



ance: 



P 



XX 2 



a 



^ (4) 



But this is also the coefficielit of determination in predicting" observed 
scores from true scores (or vice versa), i.e., the correlation between 
parallel measurements is equivalent to the square of that between observed 
and true score components: 

2 

PXX " ^XT 

(5) 

A little bit of algebraic manipulation of equations (2) and C4) gives us 
an equation for the error variance in terms of reliability and observed 
score variance. In standard deviation terms , this equation is ■ 



Or- ~ Oy P 



XX 



(6) 



and is commonly referred to as the standard error of measurement . Noting 
again the relationship in (5), this equation also represents the standard 
error of estimate in predicting X from T: 




So much for theory. In practice we have only what we observe— raw 
scores X and the variance of, these scores which we use as an estimate 
of . In view of the above theoretical' relationships, if we can, also 
estimate pj^^^, then estimates for the remaining parameters can be automa- 
tically computed. The estimate of reliability (denoted r^^ ) is usually 
obtained in one or more of three fundamentally different ways with atten- 
dant differences in empirical interpretation. 

Reliability as Stability . This is the test-retest formulation of re- 
liability as the correlation between two administrations of the same test 
over a specified interval of time. If the time interval is too long and 
allows for true individual changes in the construct being measured, then 
the test-retest correlation has little to do with reliability. But if the 
time interval is well-defined in relation to the expected consistency in 
individual true scores over that period of time, then the test-retest cor- 
relation estimates the stability form of test reliability. 

Reliability as Equivalence . "This is the test-retest formulation' of 
reliability as the correlation between two administrations of parallel 
tests at the same (or nearly so) point in time. This procedure most closely 
approximates the classical reliability definition but relies' heavily upon . 
'the extent of true equivalence between the tests. (The same test could. 
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/of course, be used twice, but then- practice effects might lead to in- 
V flated «test-retest correlation.) This procedure most closely approxi- 
V mates the empirical assessment of Accuracy as discussed in the previous 

section. 

Reliability as Internal Consistency , this is the test-retest para- 
digm taken to its logical conclusion. For example, split-half reliability 
is one form of internal consistency equal to the correlation between two 
random halves of the test when adjusted upwards by the Spearman-Brown 
(Spearman, 1910 & Brown,, 1910) equation to correspond to the full Jength 
test. But then we could compute a "split-fourths" coefficient by averag- 
ing all possible correlations between four random quarters of the test and 

r 

adjusting this average accordingly. Eventually, we get down to the item 
level, treating each item as a parallel replicate "test."* The intraclass 
correlation. Caverage inter-item correlation) stepped-up by a factor of k 
Cthe number of items on the total test) by the Spearman-Brown formula turns 
out to be equivalent to the mean of all possible split-half coefficients 
(computed using the Rulon-Guttman formula [Rulcn, 1939 & Guttman, 1945]) 
and was originally derived by Kuder and Richardson (1937) as their formula 
number 20: 

(8) 



KR20 



- Ep.(l-p.) -1 
1- - ^ 



Since p^(l-p^-) is the variance (s^. ) of a binary item, this formula is 
often written more generally as 



KR20 = ^ 



k- 1 
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(9) 
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Moreover, since the total variance can be decomposed into an additive 
sum bf all item variances and twice the sum of all possible inter-item 
covariances, this formula can also be written as 



KR20 = 



''ij^i^j (10) 



average interitem covariance 



1^ / average \ ^ k - 1 /average 1nteritem% 
k Mterri variance' k . ^ covariance 



Frdm. equation (10) it is evident that this estimate of reliability 
(a) approaches 1 as the number of items increases (so long as additional 
items are positively correlated with the total test score) and (b) is 
a measure of the extent /to which items are intercorrelated— with each other 
or, equivalently, with the total .test score. Hence, the use of the term 
"internal consistency,'* It becomes clear, then, that this is not only 
an index of reliability, but also an index (necessary but not sufficient) 
of the extent to which the set of items comprising the test are measuring 
the same construct (ability). In the sense of in-ternal consistency, 
therefore, reliability has a direct bearing upon the construct validity 
of the test. As noted above, it is for this reason that many traditional 
test theorists and practitioners ha^e use^^the terms "homogeneous" and 
"unidimensionalJ' to refer to this property of a test. 

In a nutshell, these are the tenets and consequences of classical 
/ test theory. I have ignored a few other important consequences, primarily 

those having to do with the conceptualization of validity (effects of 
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test length, correction for attenuatipn, and so forth). For purposes 
of comparison, however, the concepts so" far developed are sufficient 
-to illustrate what I believe to be'profound differences between cl as- 
sical test theory and other, perhaps more realistic, measurement models. 

Item Sampling Theory ^^^^ 

One of the more difficult assumptions to accept (and empirically 
realize) is that requiring strictly parallel tests' (or items). But with 
a slight shift in perspective, this assumption can be avoided. 
Consider again the layout in Figure 1. Suppose the k items are a random 
sample from a conc6{>t«A^y infinite population (universe, domain, pool, 
bank, etc.) of iteflfTover which a student's score would be meeiningful. 
This score would theoretically be the student's true score. Likewise, 
the n students can be conceptualized as a' random sample from an infinite 
population of students. And an item's true "score" (difficulty) is the 
theoretical average score on that item for the population of students. 

In essence, what we have is the well-known random effects analysis^ 
of variance design, i.e., an n-by-k, students-by-items, random matrix 
sample from an infinite students-by-items matrix population. Once again, 
a linear, additive model is assumed; adopting the convention of using 
Greek letters for the population parameters, any student's (s) observed 



score on any item (i) is decomposed as foil 



ows; 




= V + + + e^, . (11) 



where p = the overall mean r^ 

general l^eveT of res 
to no response zero; 

\ = tru6 score for students 

= true score (difficulty) 

e . = residual or error effect whXch could 
^- also be regarded as the studisnt-by-item 
interaction efifect (x tt^i ) for a design 
with one random observation per cell. 

With the addition of one more critical assumption—the statistical 
independence of student-item responses— the components of variance mean 
^ square expectations shown in Table lean be derived (Cornfield &'Tukey, 
1956). . .. 

■ . Table 1 . " 

J ■ 

/Components of Variance Mean Squane^xpectations 
For the n x k Random ANOW Model 



Source . 
Students 

I tems 
Error 



df 



n - 1 



k - 1 



Mean 
Square 

MS^ 



MS, 



) 



(n-l)(k-l) MS^ 



Expected 
Mean Square 



e T 



Z 2 

a + no 
e Tf 
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Now, an internal consistency form of reliability can be derived with- 
6ut resorting to a definition based' upon strict parallelism. Already, in 
accordance with the model, items can be characterized as randomly "parallel. 
We can proceed directly by defining reliabilfty (p^^) as the proportion of 
total score variance (aj) that is the true score variance (a^). Since the 
model implies that ; 
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X T k e , ^ (12) 



reliability can be expressed as 



2 

o 

X 



(13) 



Using mean squares as estimates of their corre,s ponding expected values, 
reliability can be estimated as 



'"xx 



S (14) 



which, with a bit of algeijraic manipulation, can be shown to be identical 
to equations (8), C9) and^O) above. CThis form of KR20 was first derived 
by Hoyt, 1941.) ^'mS^, of course, is the corresponding estimated Standard 
error of measurement equivalepUto equation C7). 

In terms- of at least- two" important applied consequences Cand there 
are more), then, both classical test and item sampling theories lead to 
the same result. Perhaps they are more similar than one might think. 
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Indeed, with the exception of the strict versus randomly parallel test 
distinctions, both theories are formally equivalent. It can be shown that 
the Cornfield and Tukey (1956) assumptions of the random components model 
imply assumptions (a), (b)-and (c) above for the classical test theory 
model, and vice-versa. (-See Lord and Novick, 1968, section 2.7.) 

Nonetheless, the ANOVA framework implied By the item sampling model 
provides a convenient conceptual and analytic rubric that "liberates" 
(Cronbach, et al . , 1963) the several classical reliability notions— that 
is, the sampling model emphasizes the multipl icity of possible reliability 
coefficients depending upon pra-ctical measurement consequences. Cronbach 
and his associates (Cronbach, et al . , 1972) have formalized these concepts 
under the label "general izabil ity theory." In the simplest design, namely 
that represented in Figure 1, the "general inability" coefficient is, of 
course, given by equation (14), designated previously by Cronbach (1951) 
as coefficient alpha (a). But other more complicated designs' are also 
releva^nt and are obtained by adding more factors (facets)— and, therefore, 
more than one kind of true score parameter each wit")! its corresponding re- 
liability coefficient— to the ANOVA design. Suppose, for example, n 



classes are observed k times by r raters on o occasions. We can how talk 
about (and 'compute) reliability coefficients not only for the main effects 
due to observations, raters and occasions, but for the possible inter- 
action effects as well. Using generalized Spearman-Brown procedures, data 
from one study can then be used to estimate the k, r and o necessary to 
reach desired^reli ability levels in a future study. Moreover, some facets 
might be considered fixed and others, random; and some populations finite. 
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others infinite—all depending upon the practical applications intended. 

However, notwithstanding the considerable conceptual and applied 
beriefits accrued through liberating classical test theory of its strict 
assumption of parallel measurements, both theories conceive the fundamenta 
dynamic of an achievement test identicaTfy: Items play roles as replicate 
measurement rules rather than calibrations on a single measurement rule. 
Hence, they are first and foremost theories of accuracy—not of precision— 
as these .concepts have been defined above. 
Binomial Error Model 

An interesting twist on the item sampling , model occurs if we restrict 
our attention to the single student s and conceptualize his/her responses 
to a random sample of k items as k in^^ndent binary events , each- with 
the probability of a correct answer where r;^ is the hypothetical ly 
true proportion\correct score for student s in the population of items 
from whence the ^ample was drawn. This is the simplie "loaded coin-flip- 
ping" model, i.e., a binomial model, where the probability for success 
(sayt "heads") is p. Over repeated trials of n coin flips each, the 
standard deviation of the sampling distribution (i-e., t.*e standard error) 
of the observed proportions of "heads" is well known to be / p (l-p)/n. 

Translated to the notation and purpose here, the standard error (of 
measurement ) for student s is the standard deviation of his/her sampling 
distribution of observed propgrtion correct scores (X"^) on repeated ran- - 
dom samples of k as described in the paragraph above. This standard 
error (denoted ) is given, 'therefore, , as 
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This standard error of measurement is estimated for each student by 
correcting (15) for sampling bias d^nd substituting observed scores for ^ 
true scores: 



^ ■ '^^ ' . . • 

• . , f- ' . (16) • ■ 

I ■ , ' ... 

It should be clear from equation (15) that for item sampled tests 
of fixed length k, different standard errors of measurement obtain for 
different true scores. Students obtaining a score of 50 percent will have 
the largest estimated standard error, i.e., .s/A - 1; Sg^ decreases syme- 
trically as scores either go up towards 100 percent or go down towards 
0 percent. 

This outcome, of course, is completely contrary to the assumpJ;ion 
of independence of true and error scores in the classical test theory 
and item sampling- models. In both of these models, the standard error 
of sneasurernent (equation [7]) is a constant for all students regardless 
of their observed scores, 

(J ■ ■ 

We can, however, derive a single standard error of measurement for 
the binomial model by simply computing the mean of the i^^dividual s^. . 

r ■ 

To do jthis requires generalizing the binomial error model for an indivi- 
dual's score to that for a distribution of scores. (See Lord and Novick, 
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1968, Chapter 23.) And in so doing, a- couple of interesting results 
emerge. Assuming a linear relationship between true and observed scores 
the usual formulation of reliability as the ratio of true scor^^^tonab- 
served score variance leads to the following estimate for ojrcernal con- 
sistency: 



k- 1 



(17) 



This, of course, is Kuder and Richardson's formula 21 developed originally 
as an. approximation to KR20. ■ Clearly, it is a function only of the ob- 
served score mean (or' mean item difficulty since np" = x ) and observed 
score variance. KR21 will always be less than KR20 unless there is 
no variation in item difficulties . When all items are of equal difficulty, 
they are, of course,- equal to their average and formula (17) becomes i>den- 

t'ical to formula (8). / 

Analogous comparisons hold for the standard 'err-or of measurement. 
For the binary model, it follows thit the estimated correlation between 
true and observed scores is / KR21 and the estimated standard error of= 
measurement is: 



si = s^/l - (KR21)' 

(18) 



It can be easily shown that s^ is the mean of the individual student stan- 
dard errors of measurement s^^ . This quantity will always be greater 
than its analogue in classical ^and item sampling models (equation [7] vnth 
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sample estimates) unless, agai,n, item difficulties are equal. 
Discussion 

Thus, excepting the test construction consequences of strict versus 
randomly parallel items, all three "traditional" models appear, for all 
practical intents and purposes, to be equivalent when item difficulties 
are equal (or nearly so). This makes a lot of sense when one teases out 

model. In the general binary error model, the true score is a parameter 
' of the item population, but each student receives a different randomly 
sampled set of items . Ordinarily, a student will h^ve different true 
scores on each of^ those item samples, but these are not the true scores 
of interest. Rather, it is the mean of these true scores (the item popu- 
lation true score) that is to be estimated for each student. A similar 
conception of true score holds for the item sampling model except that 
each student responds to the same randomly sampled set of items . The 
classical model is a degenerative form of the item sampling model where 
all TT-j are equal. But in the event that items are all of equal difficulties, 
true scoreswill. be identical, in each item sample, and, of course, these 
are identical to the true score in the populatiori. However, if this is 
not the case, and students respond to different/ item samples, mope variation 
/can be expected to enter into any summary statistics designed to reflect 
measurement error. 

So where in these "traditional" test thjeorie/is the concept of precision 
as I have defined it? VWere do the theories \pej/k to the construction and 
calibration of the measurement Hevice? Again, the answer is nowhere. I am not. 
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of course, suggesting^that items go unrecognized in traditional test 
theories. However, I am suggesting that the item parameters, for ex- 
ample, in the model specified by (11), .are there mostly by default. 
Moreover, Tm suggesting that precision , which is indeed gained in the 
composite test score, is serendipitious— items are invariably nonparallel 
and tests are usually long enough with sufficient variation in item 
difficulties so that total scores are at least positively and monoton- 
ically related to the underlying ability continuum. Put slightly dif- 
ferently, I am suggesting that the wrong theoretical framework for con - 
ceptual izating the act of measurement has been used to evaluate what 
turns out to be a fairly common and intuitively sensible approach to 
the measurement of ability . 

Consider this ironic outcome in terms of classical test theory: dif- 
ferences in item difficulties (desirable building blocks for measurement) 
are evidence for violating the fundamental assumption of parallelism for the 
internal consistency form of reliability. Moreover, such differences automat- 
ically put a ceiling on the maximurn level of - KR20 (or alpha) due to the 
ceiling on phi coefficients when marginal proportions are not identical. For 
these reasons, we all learned that the "best" possible test was one with 
items of near equal difficulty and, preferably, all at the .5 level to 

maximize the potential for total score variance--all nice ingredients for 

\ 

\ 

norm-referenced applications. Not surprisingly, 'N|t is under the "ideal" 
condition of equ^l item difficulties that all three^\tradi tional test theory 
models are, for practical intents and purposes, identi^qal . 

This "ideal" student- item response pattern highlights the folly of 

57 
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treating items as merely short (the shortest) repeated tests. As implied, 
above, maximum KR20 obtain when items are at the '.5 difficulty level and 
all students either get all items right or wrong. For a k-item test, then, 
half the students have a score of k and half have a score of 0. Clearly 
little information is obtained when only two decisions can be made. 
(Latent trait models, which attack the issue of calibrating test items 

JlrJ.cllX»..PAn. A9J^ r£^P^!5 ^^^^^^^ since they have 

no utility in- pinpointing locations on the latent continuum.) Equally 
ironic implications of this "ideal" score matrix occur for validity co- 
efficients. (Sae Loevinger, 1954.) It is a rather sad commentary that 
"something fishy" about classical test theory was smelled early on by 
scholars who continued to propagate the methods: 

It may be, if items of graded difficulty levels are 
used, that counting one point for each item correct 
is* not a proper scoring method. The score assigned 
should rather be a best estimate of the difficulty 
level reached, analogous to that used in the Binet 
test Another limitation in the theory here de- 
veloped should be pointed out. The criterion pf max- 
imizing test variance cannot be pushed to extremes. 
Test variance is a maximum if half of the population 
makes zero scores, and the other half makes perfect 
scores. Such a score distribution is not desirable 
for obvious reasons, yet current test theory provides 
no rationale for rejecting such a score distribution. 
Obviously the "best" test score distribution is one 
which accurately reflects the "true" ability distri- 
bution in the group, but there is perhaps little hope 
of obtaining such a distribution by the current pro- 
cedure of assigning a score based upon sheer number 
of correct answers. At present the only solution to 
such difficulties seems to lie in some type of abso- 
lute scaling theory.... (Gulliksen, 1945, pp. 90-91.) 

As a final example of the ironies inherent in classical models con- 
sider the classical test theory notion of a constant standard error of 
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measurement for every possible score. Does it make sense that particular 
high (or low) scoring students would have the same random error distri- 
butions around their true scores as would intermediate scoring students? 
At a purely intuitive level this doesn't make much sense at all. The 
binomial error model makes it clear that errors are smaller at the ends 
of the score distribution and larger towards the center. This makes per- 
fect sense if we think of sampling items as ano/logous to sampling balls 
from an urn to achieve accuracy of estimation—blue balls are items an- 
swered correctly, red ones are incorrect items, and a student's estimated 
true score is the proportion of blue balls obtained when selecting k balls 
at random from' the urn. 

But it makes no sense if items are conceived as fundamental building 
blocks of the measurement process. In this case, "error" ought to become 
much more associated with the precision of measurement. In fact, the 
error pattern should be the complete reverse of that predicted by the bi- 
nomial model. Errors would be larger toward the extremes of the score dis- 
tribution and smaller towards the center. At the extremes, we know nothing 
about the ability level of persons scoring 0 or k on a k-item test. The 
analogy to physical measurement is again instructive. It is equivalent 
to selecting that bundle of sticks of appropriate length such that they 
can center on the person's height. If the smallest stick is too long 
Ca 0-scorer) or the longest stick too short (a 1-scorer), we have failed 
to measure the person's height to within the given units of precision. 

In sum, it can be said that classical (and classical Jljke) test 
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theories are good models for assessing the dependability of measureY ? 
ments whose internal measureirient properties are already well understood 
or at least accepted as given. (Generalizability theory becomes particu- 
larly useful in these circumstances as noted previously.) But they are 
poor models for directing and assessing the development of item-based 
measures which, as suggested by the physical measurement analogy, rely upon 
item difficulties as proxies for calibrations on the "ruler." Again, 

many achievement tests produce useful results serendipitousJxXQr. 

obvious reason that practitioners of classical testing methods sense the 
necessity for including items of varying difficulty. But the reasons 
for the eventual presence or absence of items on their tests are the 
wrong ones, being rooted in a "theory" of dependability rather than mea- 
surement . I will now turn to an illustrative survey of some measurement 
mo^s which are theoretically oriented in the latter direction. 



For lack of a better one, I am using the term cumulative to refer to 



edge the measurement function of items as heretofore discussed. If not already 
obvious, the descriptive value of this term will be apparent shortly, A 
potpourri of these models will be presented in just enough detail to high- 
light how they radically differ from classical (and classical-like) test 
theories in their conceptual approach to the measurement act. All these 
cumulative models approach the measurement act directly (using the items-as- 
sticks notion) relying on item difficulty variance for precision and cali- 
bration and the total score(or a function of the total score) as an indicant 



of the ability being measured. 

Before beginning this survey, I wish to note a side benefit to using 





a rather heterogeneous class of measurement models which explicitly acknowl- 
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tire "items-as-sticks" notion in developing/ a measurement rule (i.e., test). ^ 
In 1963, a semnal article by Glazer stimulated the so-called criterion-ref- 
erenced testing movement. Soon thereafter, an important article by Popham 
and Husek (1959) rightly noted the inappropriateness of norm-oriented classical 
test theory methods for handling the development and analysis of criterion-ref- 
erenced tests^ The literature virtually exploded with attempts to adapt_ class- 
ical "l^t. theory to fit the requirements of criterion-referenced tests. The 
focus of these efforts was quite misdirected. The fundamental issue was not 
testing or even purpose of testing; rather, it was an issue of measurement. 
The"~proper^roTe"of items In a test forces (or ^hoalij^'fOTW}' the"test constr 
to match item content with the cognitive processes to be -assessed. Assuming 
a singular construct and a scalable set of k ifems having different difficulties 

k + 1 "mastery'Mevels can be assessed. "Criterion-referenced testing", there- 

9 

fore,' is simply sensible measurement. Of course, following sensible measure- 
ment, one can always (a) select a particular mastery level for criterion-ref- 
erenced decisions or (b) compile group statistics for comparative purposes, 
thereby developing norm-referenced test interpretations . — 

Guttman's Scaloqram Analysis 

David Walker (1931, 1935, 1940), perhaps the first person to recognize 
the value of the doubly ordered raw score matrix, began a series of investi- 
gations on the relationship between response patterns and the resultant shape ^ 
of score distributions. In the course of this inquiry. Walker conceptualized 
the ideal response pattern and attempted to index departures from this pattern, 
a condition he nicknamed "hig" after tfte term "higgledy-piggledy" to describe 
the apparent haphazardness in non-ideal response patterns. But his interest 
centered on implications for test score scatter rather than the more profound 
implications for measurement itself. 

Guttman (1944) reversed this focus and formalized a scaling procedure 
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for assessing the degree to which items conformed to the ideal response 
pattern. Figure 3a presents an example of an ideal cumulative response 
: pattern for 20 students responding to five items. However, that this 
is an ideal pattern is not immediately obvious until the score matrix 
is arranged in rank order on both student scores and item difficulties. One 
such convienent "double sorting" of the score matrix orders students 
from highest to lowest scores and items from easiest to most difficult. 
In Figure 3b we see the cumulative nature of the scoring pattern 
inherent in the unsorted data as presented in Figure 3a. Figure 4 
presents the same score distribution, but this time th'ere are some 
"errors," i".e., student-item responses which do not fit the ideal 
pattern. For example, student 8 should have answered item 1 correctly 
and item 5 incorrectly, thereby contributing two student-itetn response 
: errors to the total 20x5 (i.e., nk) possible student-item responses. 
Finally, Figure 5 depicts yet again the same score distribution but 
with many errors resulting in a very poor cumulative pattern. 

To index the degree of cumulati veness present in the pattern, 
Guttman used a deterministic approach. All deviations (e) from the 
ideal pattern are errors, i.e., the approach makes no allowance for 
probable deviations. An obvious index then is the proportion of non- 
errors in the entire response matrix (1-e/nk). Guttman named this 
index the coefficient of reproducibility (REP) insofar as it reflected 
the extent to which the response pattern could be perfectly reproduced 
from th^ student scores or item difficulties. Thus, 

REP= l-_e_ . . ^19) 

nk 

\ 
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Figure 3a 

Unsorted Cumulative Response Pattern 
for anH^pothetical Ideal Score Matrix 
1 T E M S 

3,2 5 .1 4 X 




/ 


1 

\ 

» ■ • 

» 






12 


0 


1 


0 




0 


■ 2 












19 


0 


0 


0 




0 


0 












16 


0 


0 


0 




0 


1 












11 


0 




0 




0 


2 














, I 




0 


.-.-..J 


1 


4 












15 


n 
U 




n 
u 




0 


2 












2 


1 




1 




1 


5 












13 


U 




n 

u 




0 


2 


e 










3 ' 


1 




0- 




1 


4 












9 


1 




0 




0 


3 












1 


1 




1 




1 


5 










\^ 












0 












2: 


6 


1 




0 




3 










UJ 
Q 


20 


0 




0 




0 


0 










r3 












0 












H- 


14 


0 




0 




2 










<^ 


10 


1 




. 0 




0 


3 












17 


0 




0 




0 


1 












4 


1 




0 




1 


4 












8 


1 




0 




0 


3 












18 


0 




0 




0 


1 












7 


1 




0 




0 


3 








' . • i 


• 


Pi 


10 
= .50 


15 

.75 

4 


2 

.10 


18 
.90 

63 


5 

.25 











- 2.37 - 



Figure 3b 

Sorted Cumulative Response Pattern 
for a Hypothethical Ideal Score Matrix 
(Rep = 1.00; CS = 1 .00; a = .76) . 
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Figure 4 

Moderately Cumulative Response Pattern 
(Rep = .86; CS = .63; a = .57^ 
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Figure 5 

Poor Cumulative Response Pattern 
(Rep = .74; CS = .46; a = .49) 
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But REP can never be smaller than the average of theobserved item difficulties 
(p^. ) or easinesses (q^-l-p^-)» whichever are greatest. That is: • 

Min(REP) = EMax(p.,q.) (20) 

^ k 

The degree of improvement (IMP) over minimum reproducibility is, 
therefore, 

IMP= REP-Min(REP) (21) 

„ - — — ' -* ■• -~ ■ 

Moreover, the maximum possible improvement is 

Max(lMP)= l-Min(REP) (22) 
Thus, a more realistic appraisal of the degree to which items scale, 
above that expected by the marginal results alone, can be seen in the 
ratio of IMP to Max (IMP). Denoted the coefficient of scalability (CS) 
by Menzel (1967 ), this index can be written as follows: 

_ REP-Min(REP) - (23) 

l-Min(REP) 

It has usually been recommended that reasonable scalability requires 
REP> .9 and CS> .6. The score matrices in Figures 3a, 4 and 5 depict 
what are ideally, moderately and weakly cumulative response patterns. 
These descriptors are clearly reflected in the va,lues of REP and CS 
accompanying each score matrix. 

There are probably three basic reasons why Guttman scaling received 
little favor in the achievement testing arena. First, for reasonably 
homogenious objective domains, it is difficult to write achievement items 
which scale well. In fact, Guttman devised the scalogram procedure for 
attitude measurement, where it is often easier to write items with 
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distinctly different affective magnitudes (item "difficulties") cover- 
ing the same essential domain. .Second, Guttman made unrealistic claims 
regarding the power of.scalogram analysis-to test unidimensional ity 
thereby opening up the procedure to a barrage of criticism. (See, for 

J 

example, Festinger, 1947 and Loevenger, 1948.) In line with the dis- 
cussion of unidimensional ity earlier ir\ this monograph, Guttman would 
have treaded firmer ground were he to have simply sugges"tear'that a | 
scalable set af items is necessary but not sufficient evidence that v. 
a set of items measures the same thing to within reasonable evidence 
of content (and/or construct) validity. Third, and probably^mo^t^ critical , 
the model was deterministic and offered no statistical (i-e., probabil- 
istic) tests of fit. (See Torgerson, 1958.) 

But no criticism was ever directed at the most im)>ortant notion be- 
hind Guttman 's approach, namely, the measurement role of items as,' in 
essence, calibrations on a "yardstick." The approximation to the ideal 
pattern (Figure 3b) would most likely be the acknowledged goal of most 
achievement test constructors\ Yet, instead of fexpending considerable 
effort in mapping the cognitive consequences of instructional units 
and writing, testing, modifying and rewriting relevant items that do 
begin to show nice cumulative properties, test constructors have been 
content to' build tests 'on the classical test theory principle of re- 
dundancy, i.e., repeated measurements to realize 'reliability (as internal, 
consistency). 



As an interesting, aside note, even the deterministic nature 'of 
Guttman scaling was^ rendered a non-issue by a number of writers. Perhaps 
the most ingenious approach was based upon Cox*s (1954) analysis of 
covariance model for cumulative repeated measurerrtents (see Maxwell, 
1959 and Ten Houten, 1969). Other techniques were investigated by 
Goodrrian (1959), Sagi (1959) and Schuessler (1961). The point of this ^ 
note js sfmply that attention needs to be redirected towards the under- 
lying jSrinciples of measurement and away from the worry of more or less 
sensitive statistical indicators--not that the latter are unimportant, 
but that the former are much more so. 
Loevinger's Homogeneity Analysis 

In her 1947 monograph, Jane Loevinger del ivered what I believe to 
be among the best and most provocative critiques of classical test 
theory; and she followed up with an equally provocative critique of 
item sampling theory in 1965. To be sure, some of Loevinger 's criti- 
cisms were a bit overstated, particularly her judgment that tjje axioms 
-ofc classical test theory were circular (see Novick, 1966). But gen- 
erally, her view regarding the inappropriateness of Jtreating items 
as repeated measurements and her switch in focus from reliability to 
constructing cumulative' scales represents the fundatnental contribution. 

Like Guttman, Loevinger*s approach is based upon deviations from 
the ideal response pattern. Unlike REP Und its derivatives) , however, 
her homogeneity index CH) reflects these discrepancies in terms of 
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maximum expectations' given the difficulty level of the items. Assuming 
items are arranged in ascending order of difficulty, then for any two 
items i and j the usual four-fold classification table obtains: 



Item j 
1 0 



Item i 



— w — 

b+d 



a+b ' p^=(a+b)/n 

c+d 
n=a+b+c+d 



q^.=(c+d)/n. 



a+c 

p. q. 

(a+c)/n (b+d)/n 

a, b, and d are the number of students in each of the respective 
possible score patterns. Since we have arranged' the data assuming item 
i is easier than j, a+b must be greater than a+c; in proportion terms. 

Ideally, no one answering the^more difficult item correctly would 
answer the easier item incorrectly. The ideal four-fold classification 
table would then look like this: 

Item j 



Item i 



a 


b 


0 


d 



b+d 



a+b 

d 
n 
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But in the actual testing process, "errors" do occur and c, the number 
of students getting the more difficult item right but the easier item 
wrong, is often not zero. These are the deviations from the ideal 
scale types in Figure 4^ and 5. 

Loevinger's index of "homogeneity" focuses just on the outcomes a and 
c, that is on the easier item's scoring pattern for those students 
answering the more difficult item correctly (heavily outlined column 
in above schematics.) In other words, the index is based upon the 
conditional probability p>[. of answering item i correctly given' that 
item j is answered correctly. In the general case, this probability is 
given by the number of students a who answered both items correctly 
divided by the total number o?. students a+c who^ answered item j correctly: 

PilJ^^-^^^ (24) 

where p.. is simply the proportional equivalent of a, viz., a/n, which 
is the probability of answering both items i and J correctly. In the 
ideal case, perfectly homogenious items ('like in Figure 3b), c=0,and 
p.|.=l. In the perfectly hpterogenious case, we would expect items to 
function completely independently, i.e., P^j^PiPj' ^''^ which case 

^ijj^'^i by (24) above. An index -of homogeneity between the two items 
i and j can then be formed as follows: " 



observed improvement in p.i. over 
H. . = that expected under perfedt^ heterogeneity 
maximum possible such improvement if 
items were perfectly homogeneous 



i-Pi 
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In form and intent, this coefficient is analogous to the coefficient 
of scalability (23) proposed for Guttman scaling. But H . . has a number 
of further properties. Among the more interesting is the following: 



, Max(((..j) 



(26) 



where' (!>•. is the ordinary Pearson product-moment correlation between tv-o 
items which, since the. items are binary, is also the fourfol^d^gojj^t^lr- 
relation computed as: • 

= Pij - PiPj (27) 

But (().. cannot reach unity unless the marginals p. and p. are equal, 
i.e., unless the item difficulties are equal. This is exectly the 
circumstance under which the two items are useless for purposes of 
precision, i.e., they replicaie the same calibration information 
rather than add decision points to the scale. And of course this is 
exactly the condition most suited for classical test theory, a theory of 
accuracy. 

However, we can^ "correct" . by dividing it by the maximum possible 

value it can assume in the case of unequal p^^ and p.. That is 

3 

Max((t,.j) = Pj " PiPj (28) 
Pi^iPj^j 



and thus 



*ij = Pij - PiPj (29) 



Max(*.j) Pj - p^.Pj, , 
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Upon dividing both numerator and denominator of (29) by p., the equiv- 
alency given by (26) is verified. 

But the result is more than algebraic. The maximum <(> . . is obtained 
when all the students answering item j correctly also answer item i 
correctly, i.e., when p..=p.. This, of course, is the ideal cumulative 
response pattern shown in the above schematic. Thus, ^-j j/M^^C'f^j j ) is 
really measuring the extent to which this ideal is obtained and ranges 
from 0 to 1 accordingly. Unfortunately, this index suffers a bit from 
the fact that it can also be 1 in value for items of equal difficulties 
when the b cell is also zero. Even in the extreme case of Figure 6, 
the overall index (H^) of homogeneity Csee below) is unity. Guttman 
indices suffer from the same problem. In effect, the scaling indices 
being presented here are necessary but not sufficient indicators of 
the cumulative nature of the test items. (See footnote 8.) We must 
also, therefore, have some indication of item difficulty spread over 

, the ability range of interest. 

^ V 

To complete the discussion of Loevinger's approach, we note th^t 

a weighted average of H . . can be formed for a^ll item pairs i and j (such 

I 

that p.>p-) yielding an overall index of test homogeneity (H^). The 
most straightforward approach to constructing is to reconsider 
equation (29) which was formed" as a ratio of equations (27) and (28). 
Since the item variances in the denominators of (27) and (28) cancelled 
out, (29) is, in effect, the ratio of the observed covariance of items, 
i and j to the maximum possible covariance given the p^j and Py An 
overa.ll index can then be formed as a ratio of the sum of the k(k-l)/2 
unique observed covarianges to the sum of the corresponding k(l<-l)/2 
maximum covariances: 
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A Degenerate Case: 
The Perfect Classical Test Response Pattern 
( Rep = 1 ; CS = 1 ; a = 1) 
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E E (p.. - p-pj 
= i i j ' ^ (30) 

.E E (p. - p.p.) 
i j ^ 



Cov.. 



MaxTCovTTy 



where Cov.. denotes the covariance between items i and j. Some algebraic 
manipulation of (30) will verify that it can also be written as 

^ EE p.q. ^-^IJ 
1 j 

i.e., is a weighted (by p^q.) average of H.^ = <|..j./ Max(<{>.j). This 
makes intuitive sense since p^q^- is the expected proportion of errors 
in the completely heterogeneous (non-cumulative) case. 

It should be cle^r that is an average inter-item, statistic assess- 
ing the degree to which all possible ordered item pairs are homogeneous 
Cin the cumulative sense) on the average. Thus, it does not increase 
merely as a function of increased number of items as does the internal 
consistency coefficient' a in traditional test theory. This is as it 
should be since is intended to index the cumulative structure of items 
while a is aimed at assessing the reliability of repeated item measurements 

Ironically, Horst (1953), capitalizing on the seductively simple re- 
lationship between and the intraclass reliability coefficient of class- 
ical test theory, has proposed "blowing up" by a factor of k using 
the Spearman-Brown prophecy formula to correct the ceiling^fect prob- 
lem of unequal item difficulties in classical test theory. To his credit, 
Horst is among the few test theorists who has recognized conceptual 
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differences between reliability and homogeneity and devested ample space 
to Loevinger's work in his book on measurement theory (Horst, 1966). 
But although I can relate to the intended use of the modification 
offered by Horst, the modification once again confuses fundamental 
measurement issues by commingling the concepts of precision and accuracy. 

Consider, first, the specifics of the modification. The intraclass 
reliability (r^.) in classical test theory is the reliability of the 
average single-item test. It can be shown that by adjusting r^.^. upwards 
by a factor of k using the classical Spearman-Brown formula, we end ^ 
up with the KR20 (or a) formula for reliability at the total test 
level. Noting that r^.^. can be defined as the ratio of the average inter- 
item covariance to the average item variance, i.e., > 



r. .s.s . „ Cov. . 
= - 1J_ (32) 

s? Var. 

the relationship given in equation (10) leads directly to the Spearman- 
Brown "correction" as follows: 

KR20 = ^ '"ii (33) 
1 + (k-l)r.. 

Now, the maximum possible r^.^. given the disparities in item difficulties 
is 

>y u ! \ Max (Cov. .) 

Max,r,,,=__^ ,34, 

If we correct r^^. in the usual manner, it is obvious that 



cov.. 

^•i = '-^ = H (35) 

Max(r..) Max(Cov.j.) ' • . 
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The su9ges.ted -modification by HorsC therefore, is to substitute the 

corrected r.., i.e., H^, in equation (33), thereby making it possible 

for KR20 to reach unity even when item difficulties are unequal. 

Corrected KR20 = ^ "t (36) 

1 + (k-l)H^ 

Consider, second, the implication of this formula. A test can be 
perfectly homogeneous by adding an infinite number of mostly heterogeneous 
items so long as they are positively correlated. Now this seems reasonable 
for achieving increasingly accurate measurements; but it does not neces- 
sarily lead to increased precision and a more scalable set of items. 
Suppose, for example, the test is doubled in length by adding k parallel 
items, i.e., items that are equal in difficulty, one-for-one, to those in 
the original test and that scale identically to those in- the original test. 
We now have twice the test information at each ability level but still 
the same number of ability levels represented in the test. Suppose," 
again, that the new items are equally scalable but have difficulty levels 
between those of the original items. We now have the same information 
at each ability level but twice the number of ability levels that can 
be assessed. Formulas such as (36)/ "blow-up" the index indiscriminately 
thereby conflating the issues of accuracy and precision. 

Horst (1966) makes an effort to distinguish reliability and homo- 
geneity by noting that reliable items are a nec^sary but not sufficient 
condition for high H^. Thus, high is, in part a function of reliability, 
Now this is true for reliability at the item level. But it is not true 
for reliability (as internal consistency) at the test level. Again, I 
am trying here to^plearly separate the precision obtained through cali- 
brating a homogeneous or unidimensional test from the accuracy of test. 
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Bentler's Monotonicity Analysis \^ 

I include a discussion Bentler's (1971) approach here primarily 
to emphasize that multi dimensionality is not an intractable issue when 
measurement is conceived and operationalized as a cumulative scaling pro- 
cess. Thus far I have avoided the issue of empirical dimensionality 
suggesting, instead, that a scalable or homogeneous set of items plus 
reasonable evidence of content validity is a necessary but not sufficient 
condition for unidimensionality. Although I (and others) often use the 
terms unidimensional and homogeneous synonymously, it should be understood 
that the fomer is not an automatic consequence of the latter. 

Preferring the term m onotonic (instead of cumulative), Bentler 
quite cleverly recognized that Yule's Y coefficient ( a simple function 
of the more familiar Yule's Q coefficient) for association in a four-fold 
table (see Yule; 1912) possessed none of the drawbacks of 
* °^ */*max ^^^^ subjected to an ordinary principal components factor 
analysis. For any two items i and j, this index, renamed the monotonicity • 
coefficient by Bentler since he developed it in a more general form, 
is. given as follows: 



be - ad 

m - 



be + ad + 2 abed 



(37) 
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where a, b, c and d are as given in the four-fold table layout in the 
previous section. The nice thing about Yule's association measure is 
that it becomes 1 (or -1) only when one (or more) cells are empty. 
These include exactly those four-fold response patterns of cumulative 
scales; and a principal components factor analysis of the inter-item 
m-matrix will recover two or more cumulative scales embedded in a set 
of items. 

. As an index of homogenity, m is very similar to H... And, like 

* * \j 

Lbevinger, Bentler proposes the average of all k(k - l)/2 inter-item 
monotonicity coefficients, m., as an overall measure of inter-item 
homogeneity. But then, like Horst, Bentler becomes concerned with the 
length of the test not being represented in the index. Thus, he pro- 
posed the same Spearman-Brown transformation of m for a final, overall 
measure of the test's homogeneity (h), 

h = k m , (38) 

1 + (k - Dm 

and, in my view, falls into the same trap of mixing up fundamentally 
distinct measurement issues. . 
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Sato's Student-Problem (S-P) Matrix Analysis 

Sato (1980) developed yet another means for indexing departures 
from the perfect Guttman or cumulative scale. But this time the notion 
seems to have caught- on. It is difficult to tell at this time whether 
it is the novelty of the procedure (and its more sophisticated mathe- 
matical basis) or whether more methodologists have begun to internalize 
the need to reconceptual ize the proper measurement role of items. In 
any case, Sato's contribution reiterates the appropriate focus for un- 
derstanding the measurement act, viz., the doubly ordered student-by- 
item (problem) matrix of raw responses (e.g.. Figures 3b-5). 

Interestingly, Sato*s approach, unlike those discussed previously, 
utilizes a mathematical model of tlie ideal non-cumulative response pat- 
tern. An index of fit, then, is based on the extent of observed response 
pattern departure from the perfectly heterogeneous model. Specifically, 
any ordered student-by-problem, (item) matrix can be partitioned into 
sections corresponding to the expected ideal cumulative patterns based ? 
on either the student scores, the S-curve , or problem scores (item diffi- 
culties), the P-curve . 

Figure 7 depicts the process of analyzing the student- problem matrix 
in this manner. Figure 7 is simply Figure 4 again, but this time the 
cumulative student and problem score distributions are presented, separately, 
and superimposed, on the S-P matrix itself. As an exercise, superimpose 
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S-P .Matrix and Cumulative Distributions 
for Student Scores (S-Curve) 
for Problem Scores (P-Curve) 
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the S-c\irves and P-curves appropriate for the matrices in Figures 3b and 5. You 
will discover thaFin the ideal case (Figure 3b) the S- and P-curves are coin- ^ 
■^ident; and in the cd^se of pooi^ cumulative response pattern (Figure 5), the 
curves are quite far apart and much more so that they are for the moderately 
cumulative pattern exhibited here (and in Figure 4). ' 

Thus, the area between t'he S- and P-curves— proportional to the nwfriber of 
student-item responses between the curves —reflects the degree -of departur^'from 
the ideal cumulative response pattern. (In general, the number of stud^ent-Uero 



responses between the S- and P-curves is close to, but is" no^jfijncrfionally re- 
lated to, the total number of Guttman errors, vi z. ,jtwi,ce^h.e number of O's 
above, or I's, below, the S-curve.) To construct-in index similar to'the ca- ^ 
efficient of scalability for Guttman seal es the- maximum possi^ble area between^ 

the S- and P-curves must be calculated for the perfectly heterogeneous student- 

'I ■ ■ , 

problem response matrix of the same dimensions and mean performance. Sato 
models the ideal ' heterogeneous matrix by assuming simple binomial sampling for 
problems and students. Thus, the .cumulative binomiaT distributions with 
parameters k and p and parameters n and p model the S- and ^-Curves respec- 
tively. Denoting the areas between the observed and binomial S- and P-curves 
as A(n,k,pO and Ag(n,k.pO respectively, Sato's disparity coefficient is given 
as follows: 



A(n,k,p) . 

AB{n,k,p) • ^3g^ 



(A more computationally tractable estimate of D is given by Sato, 1980.) 



- 2. 56 - 



This index reaches 1 in the case of perfect heterogeneity and 0 in 
the case of a perfect cumulative (homogeneous) response pattern. It there- 
fore varies inversely (and I expect quite highly) with the other indices 
of homogeneity discussed in this section. Moreover, Sato (1980) defines 
analogous coefficients at the individual student and problem levels (called 
caution indices) which serve to highlight those students and items which 
depart considerably from ideal expectations. Loevinger (1947) developed 
a similar index for items whereas Guttman relied exclusively o\' visual in- 



spection of the response matrix. In the final analysis, the increasing 



popularity of Sato's approach is most likely due to the emphasis placed on 
the raw score matrix, with handy indices (for spotting aberrant cases) of 
great practical utility for the ordinary classroom teacher. For recent 
developments in the u". S., see Tatsuoka (1978), McArthur (1981), Harnisch 
and Linn U981), and Miller (1981). CSee also the chapter by McArthur 
in this monograph. ) 

Rasch Measurement: A Latent Trait Model 

Latent trait theory, or item response theory (Lord, 1980), refers 
to a whole class of statistical measurement models based on the same fun- 
'■^^ damental conception of the measurement act guiding the cumul,ative models 
surveyed thus far. However, latent trait models make important allowances 
for those "minor" points we glossed over while drawing^the analogy to the 
physical sciences. Specifically, these were the points relating to the 
variability of both the item difficulty positions as "hash marks" on \ 
the "ruler" and the underlying ability continuum itself, as one moves 
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from one "ruler" to the next. For our purposes here, we will review 
only the simplest of the^latenf trait moaels, viz^. , the 1-parameter 
model, developed three decades ago^'by Georg Rasch. A number of good 
presentations and/or reviews of latent trait models generally, and the 
Rasch model in particular, currently exist. Some examples are: Rasch ^ 
(1980 reprint of 1960 edition); Wright and Stone (1979); Hambleton and 
Cook (1977; see that entire issue of the Journal of Educational^Mea- 
ment ); Lord (1980); and Traub and Wolfe (1981). 

' i 

The Rasch model (and latent trait model 'i generally) assumes a simgle 
invariant ability parameter and specifies a probability function over 
the entire 0-1 range that any item will be answered correctly by students 
of a given ability. Specifically, Rasch first approached the problem 
by imagining independent person and item parameters reflecting, respect- 
ively, ability and difficulty (or, its reciprocal, easiness). Second, 
he envisioned the same cumulative response pattern as the ideal outcome 
when persons with varying abilities encounter items of varying difficulties. 
But he modeled the process probabilistically, not only to avoid the deter- 
minism of previous approaches, but to establish an invariant measurement 
scale — so long as the model fits the empirical reality of the test data 
in question . 

The model he selected is a simple odds ratio, i.e., the odds (O^^i^ 
of student s with ability correctly answering item i with difficulty 



D. are given as 



t. = ^ (40) 
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Instead o-^-^ds, we can use the more convenient 0-1 scale of probability. 
If .Pg^ is the probability of student s ansv/ering item i correctly, tnen, by 
definition, P^. = ^^^/ • Thus equation (40) can be rewritten as 

PsT = . , ' (41) 

It should be clear that, as hypothesized, the model predicts a lower chance 
of success for a student with lower ability encountering a relatively more 
difficult item, a higher chance of success for a student of higher ability 
encountering a relatively less difficult item, and a 50-50 chance of success 
when the kbility of^the student and the difficulty of the item are identical 
These are invariant properties of the person and the item and are presumed 
to be independent of each other as well as of the other abilities of the 
persons being measured and the other difficulties of items doing the mea- 
suring. Again, this specific objectivity (as Rasch calls it) is operational 
only to the extent tjiac these presumptions fit the reality of the data. 

Equation (40) becomes computatiGnal ly more tractable as a simple 
linear function by taking the logarithm of both sides, i.e,, 

log (ef^.) = log (A^) - log (D.) (42) . 

Likewise, equation (41) can be so converted; but it is usually expressed 
in exponential .form using the natural base e and the substituted parameters 
^5 = log CA^) and 5^. = log^ (D^. )• In other words, e = and e ^ = D^. 
and equation (41) becomes the so-called logistic function 



Of\ourse, the same logic embedded in (43), as was in (41), except now 
the interplay of person encountering item is reflected in the difference 
between the transformed ability parameter and difficulty parameter 
5^. When equation (43). is grafjhed for all possible values of this dif- 
ference, i.e., for '^s ' '^i ^'^^'^^ ^si so-called re- 
sponse characteristic curve results (see Figure 8). This represents the 
simplest logistic model, often called the 1-parameter model, since P^^- 
is really only dependent upon the single discrepancy i^-. Alternatively, 
for fixed difficulties 8^ or abilities a^, the ogiv.e in Figure 8 repre- 
sents eqi/ally well the item characteristic or person characteristic curves 
respectively. ^ ' 

J The rather elegant simplicity of the Rasch technique for scaling is 
realized through this important property of the model: the student raw 
scores (r^) and observed ' i tem difficulties (p-) are sufficient data from 
which to derive the best estimates of a. and 6- respectively. In effect, 
the double ordering of the student-by-^item raw score matrix best estimates 

t 

the ordering that would occur were we to know the actual and 6.. Thus, 
persons with the same raw score r from the same set of items will receive 
the same ability estimate a^. 

To estimate and a and 6, therefore, the n x k raw score matrix is 
merely collapsed row-wise such that rows now constitute the k+1 possible raw 
scores and cell entries are the proportions of persons in the rth raw score 
group correctly answering the ith item. If the index r is substituted for the 



Figure 8 

Item/Persoji Characteristic Curve 
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index s in equation (43), it should be clear from the above property that 
these cell proportions (P^^-) ere all estimates of their corresponding P^^. 

In general, then, there are k(k+l) equations of the form 



a -6 J 

P = 



ri 

1 + e 



with only 2k+l i/nknown values of the a and (In practice, no information 
is provided by raw scores classes r = 0 or k or by observed item difficulties 
p = 0 or 1 and these rows and/or columns, should they occur, are eliminated 
for purposes of analysis.) 

There are several approaches to the solution of these equations and 
testing the fit of the results to what the model predicts. (See references 
OOted previously.) The important point for our argument here, however, is 
that this model again conforms to the measurement of a property as we ordinarily 
conceive of it. .Moreover', when this particular model fits the data reason- 
ably well, the parameter estimates of a and 5 are reasonably independent of 
the particular ability and difficulty levels of specific student and item 
samples, thereby^ providing viable approaches to normally thorny testing 
problems such as test equating, item banking, tailored testing, and so forth. 

Finally, it is interesting to note that for each person's ability estimate, 
there exists ^a so-cal 1 ed" standard error estimate. But the only thing this 
estimate has in common with the standard error in traditional test theories 
is its name. The latent trait standard error is really based upon an infor- 
mation function that Reflects the level of precision at the various ability 
calibrations. It bears no relationship whatsover to any notion of item/test 
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replication, i.e., accuracy (or dependability). Thus, the latent trait 
standard error is an index of precision and behaves accordingly, i.e., 
it is larger for ability estimate^ towards the extremes and lower for 
ability estimates towards the center of the item difficulty range. 

Summary 

To summarize the foregoing view and review, test theoreticians and 
practitioners must carefully distinguish their model of measurement from their 
model of the dependability of measurements. The former refers to the con- 
cept of precision that is applied in the construction of tests. The 
latter refers to the concept of accuracy that is applied to the result 
of testing under specified conditions of use. Items play a central role 
in measurement models; in models for dependabil ity, they are of incidental 
importance insofar as the accuracy of estimated ability measurements is 
of primary importance. Clearly, truly useful test theories necessarily 
require both measurement and dependability models. 

Classical (and classical-like) test theories are really models for 
the dependability of measurements. They are good for assessing the ac- 
curacy of the results of a testing process when the process is conceived 
as one Co^ several) of a great rriany (often infinite) measurement attempts. 
When earh of the repeated measurements is conceived as a replicate (per- 
fectly parallel) measure, we have classical test theory as originally 
developed. When the measurements are conceived as a random sample from 
a domain of Interest (i.e., randomly parallel measures;, we have the 
item St^-npling versions of classical test theory. At the core of all of 
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these theories, however, is the concept of repeated measurements. When- 
ever the results of behavioral assessments can be so conceived, classical 
test theories, in particular generalizability theory, enjoy a wide range 
of application. CSee the recent review by Shavelson and Webb, 1981,) 

But these test theories "dig their own grave'' when they attempt to • 
translate repeated measurements concepts to the internal structure of 
the test itself. Recasting items ;into the role of strictly parallel (or 
randomly parallel) measurements c^an't help but give rise to "test construc- 
tion" procedures ba^ed on maximizing inter-item relationships. ^ This pro- 
cedure automatically eliminates items reflecting ability at the upper and 
lower ends of the "ruler." Thus, empirical evidence for internal consis- 
tency (in the reliability sense) or homogeneity/unidimensional ity (in the 
construct">al idity sense) is based upon the wrong covariance structure. 

In constrast, measurement models attack the issue of test construction 
directly. They assume a singular construct from the start (relying prim- 
arily Qpon content validation) and proceed to develop items of varying 
difficulties analogous to hash marks on a ruler. To the extent that the < 
set of items fits the cumulative response pattern expectation, v;e have 
evidence (necessary, but not sufficient) that our measurement goal has 
been achieved. Once satisfactorily constructed, it is quite appropriate 
that thevjnstrument be subject to all relevant forms of dependability and^ 
validity ''procedures under the conditions for use in actual practice. 
These several ingredients comprise a complete test theory . 



Moreover, it shoi^ild be possible to incorporate dependability at the item 
level as well. The schematic^ in Figure 9 portrays the data box necessary 
to sort out -- at least in theory — the contrasts between te-st precision 
and both item and test accuracy. Vertical slices of the data box contain 
the data- necessary to assess the accuracy of items at each difficulty level 
for all ability levels. Horizontal slices contain the data necessary to 
test the scalability of items representing the difficulty levels for each 
replication. Cross slices could be used to assess^ the accuracy of items 
at the various difficulty levels holding ability constant. Collapsing tne 
data box along the difficulty dimension produces the data matrix necessary 
for assessing accuracy at the test level. Of course, general izabil ity facets 
could be crossed or nested with the repeated measurement trials to assess 
accuracy (.dependability) under different conditions. The comprefTTmpi rical 
suggestion of Figure 9 may be quite intractable from an operational view- 
point, although, for some highly specifiable items domains (e.g., arithmetic 
fundamentals) on which ability varies' systematical ly with other measurable 
examinee characteristics (e.g., age), it may not be too far-fetched. 

In conclusion, classical test theory has probably enjoyed a long life 
not only because of psychological well-being through cognitive dissonance 
reduction, but because tests have never really been developed without vari- 
ation in item difficulties. It is time now that we construct tests with 
varying item difficulties by design--not by happenstance--and use item / 



analysis techniques that correspond to an appropriate- theory of measurement. 
Moreover, it is fitting that this view forces, upon lis an issue of perhaps 
even greater importance, namely, the correspondence of item structure with 
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Figure 9 

A Model for Contrast/ing Accuracy 
with Precision and Calibrating a Test 
of a Singular Achievement Construct 
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the cognitive process to be assessed. (See, for example, the arguments 
recently advanced by Glaser, 1981.) It may well be that the simplistic 
' Dtions of dichotomous responses (right-wrong) to multiple choice or 
true-false items are unrealistic Indicators of the cognitive processes 
underlying the abilities v^e try to measure. Different measurement models 
from those outlined here may offer more realistic solutions. (For example, 
see the recent latent class approaches such' as Wilcox's (1981) answer- 
until-correct scheme.) 
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Footnotes 



1. I will use the term "traditional" to refer to classical and classical- 
like test theories, a distinction that will be clearer in the sequel. 

2. I have chosen Spearman's (1910) work, apparently inspired in 1908 by 
G. Udny Yule (see Yule, 1922), to mark the beginning date for classi- 
cal test theory. 

3. It is Important to note at the outset that I do not intend to extol 
any one notion of what it means to measure achievement. Rather, I 
wish to explicate a popular intuitive notion of measurement and the 
extent to which it is compatible with existing measurement theories. 

4. In general, I prefer the term "dependability" to the older term "re- 
liability." As used in general izability theory (Cronbach, et al., 
1972), dependability denotes reliability under specified conditions 
of use . At times throughout' this report, however, I will use the 
teriTr "rel iabil ity" to facilitate the discussion of traditional test 
theory concepts. 

.5. I am using the term "difficulty" here more in a parametric sense 
than as a synonym for observed p-values. ^ 

6. The analogy could be improved upon in this regard by imagining the 
sticks to be subject to increases or decreases in length as a function 
of various and sundry effects (some random and some systematic) due 

to all aspects of the measurement context. This is a less sadistic 
equivalent of Lumsden's C1976) flogging wall test. 

7. Two classical test theory frameworks are in general use. One arises 
out of. the definition of error as proposed originally by Spearman 
C1930). The other arises out of a definition of true scores as pro- 
posed originally by Brown (1910) and elaborated by Kelley (1924). 
The former approach is presented here since it's simpler. All deriv- 
ations end up being the same so that it is a purely academic matter 
which approach is "better." See Gulliksen's (1950) seminal volume 
on classical test theory and the good historical overview by Tryon 



(1957). 
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Footnoues (continued) 



8. An important caveat should be statjed here: Except for the latent 
trait models, the illustrations I fiave selected do not in and of 
themselves provide sufficient infoiij^mation for calibrating items 
and estimating precision. Nevertheless, they are useful both 
historically and heuris^cklly for lunderscoring the point of this 
discussion, viz., the contrast betw^n dependability and measure- 
ment. \ 

9. I am using the phrase "criterion-referenced testing'* in the more pro- 
found sense rather than simply as a i^rocedure for assessing a cri- 
terion level of performance. The criterion is, rather, the content 
and the attempted isomorphism between the con-tent and the measurement 
rule. To quote Glaser (1963); "Criterion-referenced measures in- 
dicate the content of the behavioral repertory, and the correspondence 
between what an individual does and the underlying continuUm of 
achievement." (p. 520) 

10. Although useful for expository purpose$ here, this is not ideally the 
best procedure for estimating a and 6./ (See the chapter by Choppin 
in this monograph. ) 
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ANALYSIS OF PATTERNS: THE S'-P TECHNIQUE 

y David McArthur 
Center for the Study of Evaluation, UCLA 

Definition of the model . A system of analyzing patterns of 
student responses called Student-Problem score table analysis has been 
developed over the last decade by a group of educational researchers 
in Japan (Sato, 1974, 1975, 1980, 1981a, 1981b; Sato and Kurata, 1977; 
Kurata and Sato, 1981; Sato, Takeya , Kurata, Morimoto and Chimura, 
1981). While the mathematics associated with derivative indices in 
this system are relatively complex, the S-P system itself i^'s 
predicated on a simple reconfiguring of test scores. Rather similar 
•analyses of student perfprmance on educational tests can be found in 
the professional literature of a hal f-fcentury ago, but recent 
developments hy Sato and, col leagues 'represent significant improvements 
both' in concept and execution. The method appears to hold a number of 
possibilities for effective and unambiguous analysis of test score 
patterns across subjects within a classroom, items within a test, and, 
by extension, to separate groups of respondents. It is a versatile 
contribution to the field of testing, containing minimal requirements 
for sample size, prior scoring, itfem scaling, and the like. The S-P 
model lends itself to extensions into polychotomous' scoring analysis 
of multiple patterns, and analysis of patterns of item bias. 

Tesi scores are placed in' a matrix in whiqh rows represent 

/ 4. 

indivi/fllual repsondents' responses to a set of items, and columns 
represent the responses given by a group of respondents to a set of 
items. The usual (and most convenient) entries in this matrix are 
zeros for wrong answers and ones for correct answers. Total correct 
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Figure 1 

S-P Chart for a Six Item Test Administered to 20 Students 



Itens in ascending order of difficulty 
rank .1 2 3 4 5 6 
item # 1 5 4 2 3 6 



Average passing rate p A .425 
Di'screpancy D* = .525 
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scores are calculated for each respondent, apd total number of correct 
responses are tallied for each item. Rows are reordered by^descending 
total number of correct, responses; columns are reordered by ascending 
order of difficulty of items. .The resulting matrix has several 
aspects which are particularly convenient for a detailed .ppraisal of 
respbndents or items, singly or col lecti vel^. ''A .short example, . 
annotated and indexed" with several computations to be explained below, 
is shown on the following page. 

Two cumulative ogives are drawn over the matrix to form the 
framework for further analysis. 'Because the data is discrete, the 
ogives take on a stair-step appearance,; but both can -be thought of as 
approximations to curves which describe in- summary form the two 
di.stinct patterns embedded in the data. The first is a curve 
reflecting respondents' performance as'-shown by their total scores; 
the second is a similarly overlaid ogive curve 'reflecting item, 
difficulties. In one special ci rcumstknce. the two curves describe 
only one pattern: if the matrix of items^and respondents is perfectly 
matched in the sense of a fButtman scale, both of the curves overlap 
exactly. All of the correct responses would be to the upper left 
while all of the incorrect responses would be to the lower right. 
However, as the occurance of either unanticipated errors by . 
respondents with high scores or unanticipated successes by respondents 
with low scores increases, or as the pattern of responses becomes 
increasingly random, the respondent or student curve (S-curve) and the 
item or problem curve (P-curve) become increasingly discrepant. Sato 
^ has developed an index which evaluates the degree of discrepancy or 
-lack of Conformation between the S- and P-curves. This index will be 
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zero in the special case of perfectly ordered sets, and will approach 

-^i r 

1.0 for the case of totally random data. 

For-'any respondent, on for any item, taken individually, the 
pattern of scores reflects that row or column in relation to the 
pattern established by the configuration of sorted rows and columns. 
For any given individual respondent or single item, the response 
pattern may be "perfectly ordered" in the sense used above.- The row 
or column shares a symmetry with the associated row or column 
marginal; in the case of dichotomous data this symmetry is seen in a 
high positive point-bi serial correlation. As the match between 
patterns declines--- that is, as the row or ^column under consideration 
shares less and less in common with the associated marginal formed 
from all rows or all columns — the point-bi seri al al so decl i nes. 
Unfortunately, ^^^^^ is not independent of the proportions within 
the' data and never reaches 1.0 in practice. Cases of complete 
"symme-pry" between row'o?^olumn and the corresponding marginal which 
happen to differ in proportions do not yield the same correlation 
coefficients. 

An index which is stable across differing proportions is Sato's 
Caution Index C, which gives a value of 0 in the condition of "perfect 
^syn^metry" between row or column and row marginal or column marginal. 
As unanticipated successes or failures increase and "symmetry" 
declines, the index increases (a modification of the Caution Index, 
called C*, has an upper bound of 1.0). Thus a very high index value 
is^ associated with a respondent or item for which the pattern of 
obtained responses is very discrepant from the overall pattern 
established by aV3 members of the set. 
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Harnisch and Linn (1982) present the modified Caution Index as 
follows: 



J 



I a - u..)n - } "ij ''•j 

* J 

, C. = ~ ; 

X n , 



where 



'i- — ^ ■ J 



i = 1,2,..., I indexes tlie^x^ami nee, 
j = 1,2,..., J indexes the.itnm, 

u-. =1 if the respondeat i answers item j incorrectly, 

0 if the respondent i answers item j incorrectly. 



n . 
1 . 



total correct for the i^^ respondent, and 



n..'= total number of correct responses to the j^^^ item. 

Harnisch and Linn explain that the name of/ the index comes from 
the notion that a large value is associated with respondents that have 
unusual response patterns. It suggests that, some caution may be 
needed i n ' i nterpreti ng a total correct score for these individuals. 
An unusual response pattern may result from guessing, carelessness, 
high anxiety, an unusual instructional history or other experiential 
set, a localized misunderstanding that influences responses to a 
subset of items, or copying a neighbor 's 'answers to certain questions. 

A large value may also suggest that some >ndi vi dual have 
acquired skills in an order which is not characteristic of the whole 
group. The index says nothing about the most able respondents with 
perfect total scores, because the "symmetry" cdndition is met. More 
importantly, if a respondent gets no item correct whatsoever, both the 
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total score and the caution index will be zero since, again, the 

"symmetry" condition is metjrin this situation the ava.ilable 

information about the respondent is insufficient to make any useful 

diagSTTs. Most persons, though, ^ill achieve total scores between 

the extremes and for them the caution index provides information that 

is not contained in the total score. A large value. ef the caution 

index raises doubts about the validity of the usual\ interpretation of 

\ 

the total score for an individual. 

A relal;ed development is a modification of the Caution Index to 
examirf^ patterns of responses to clusters or subtest scores and an 
"ideal" pattern of scores of individual subtests, the perfect Guttman 
pattern (Fujita and Nagaoka, 1974, in Sato, 1981). 

Sato has developed an index of discrepancy to evaluate the degree 
to which the S and P curves do not^onform either to one another! or to 
the Guttman scale. Except in the case of perfectly ordered sets there 
i-s- always some degree of discrepancy between curves. The index is 
explained as follows: 

■ D* = A(I,J,p) 
Ag(I,J,p) 

where the numerator iS the area between the S curve and the P 
curve in the given S-P chart for a group of I students. who took 
J-problem test and got an average problem-passing rate p, and 
A (I J p) is the area between the two curves as modeled by 
cSmulative binomial distributions with parameters I, J, and p, 
respectively (Sato, 1980, p. 15; indices rewritten for 
consistency with notation of Harnisch & Linn). 
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The denominator is a function which expresses a truly random 
pattern of 'responses for a test with a given number of subjects, given 
number of items, and given average passing r^te, while the numerator 
reflects the obtained pattern for that test. As the value of this-v 
ratio approaches 1.0, it portrays an increasingly random pattern Wp'l 
responses. ?Sr the perfect Guttman scale, the numerator will b^CTand 
thus D* will be 0. The computation of D* is 'functional ly derived from 
a model of randj)m responses, but its exact mathematical properties 
have not been investigated thoroughly. 

Also availjable, but not yet studied in detail, is an index of 

I 

"entropy" a^so^iated with distributions of total scores for- students 
choosing differTent answers to the same question. This index explores 
the particular pattern of resp'onses (right answer and all ^di stractors 
included), in the context of overall correct score totals for these 
responses. 

While most of the published work using the S-P method has 
concentrated on binary data (0 for wrong answer, 1 for right answer), 
and calculations are most tractable in that form, the indices 
developed from the configuration of S- and P-curves are not limited to 
such data. The technique can extended to multi-level scoring (see 
Possible Extensions to the model, below). 

Measurement philosophy . A precursor to the S-P method is the 
concept of "higgledy-piggledy" (or "hig" for short) suggested by 
Thompson about 1930 and elaborated by Walke^^in a trio of contj-ibutions 
(1931, 1935, 1940),. but evidently carried no further by educational 
researchers at that time. Walker examined right/wrong, answers to a 
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set of independent items with particular reference to score-scatter, 
which had been a focus of attention since the early twenties. Where 
scatter reflects random behaviors on the part of examinees, "hig" is 
said to be present. However, 

By a test being unig (the converse of hig) we mean that each 
score X is composed of correct answers to x^ easiest questions, 
and therefore to no other questions. Hig implies a departure 
from this composition. Note that it is not sufficient for our 
purposes to define unig by stipulating that every score is 
identical in composition— there must be added the condition that 
it is ,composed of the x easiest items; in other words the score x_ 
+ 1 always compromises .the _x items of the score x, and one more. 
Now if hig is absent, that is each score is unig, it, is easy to 
show that an exact relationship exists between the n's of the 
answer-pattern and the n's of the score scatter (1931, p. 75). 

The parallel to Guttmart scaling, while the latter^is far more 
mathematically rigorous, is obvious; Sato's indicei appear to address 
the same underlying concepts. 

Guttman's (1944) statistical model for the analysis of 
attitudinal data was formulated to solve scaling problems in the 
context of morale assessment for the U.S. Army. While the initial 
approaches- were not at all techni cally -sophi sti cated and involved much 
sorting of paper by hand, Guttman's conceptualization was powerful; 
the scalogram approach, and especially its mathematical underpinnings, 
received extensive development during the 1950's.- But by 1959, 
Maxwell had expressed rather strong disappointment with the narrow 
range of application these procedures had enjoyed, and suggested two 
general statistics which mi^t serve to dissolve the arbitrary , 
distinction between qualitative and quantitative scales, and, at the 
same time,' reduce some of the cumbersome calculations. (One of these 
statistics is a regression coefficient developed from the residual 
between observations and perfect patterns of responses to a given set 
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of items, which bears some conceptual resemblance to Sato's D*.) 
However, the primary audience for these technical contributions 
appears t'o have been educati^snal statisticians and researchers. 
Only infrequently was attention given to simpl i fyi ng the techniques 
for a broader potential audience (Green's (1955) contribution is one 
exception, although publ i shed • i n ' a highly sophisticated journal). 

Many of the publications by Sat"o and colleagues in Japan seem 
geared directly to end-users, teachers in the classroom who, with the 
S-P method and handscoring or microcomputer processing, can analyze 
their own instructional data for purposes ^of understnadtng their 
students' comprehension and modifying their own instruction. The 
overarching concern of the Educational Measurement and Eva.'uation 
Group at' the Nippon Electric Company's Computer and Communication 
Systems Research Laboratories has been development and dissemination ^ 
of readily understandable and ada-ptable procedures. Evidently it has 
proved popular in a variety of classroom settinis^in Japan, and has 
been applied to the following areas: 

- test scoring and, feedback to each examinee about his/her own 
performance on a test 

- feedback to the instructor about both individual and group 
performance 

- analysis of types of errors made by students 

■ - analysis of instructional process and hierarchies of 
instructional units ^ 

- -item analysis; 'rating scale analysis, questionnaire analysis 
/ - test score simulations 

- development of individual performance profiles across repeated 
testings 

I'Jb . 



Two characteristics are shared by all of these approaches: 
first, the central focus of the study is the degree to which items 
and/or respondents.are heterogeneous, and second, the actual element 
of- raw data (say, 0 or 1) is assumed to be best understood in terms of 
its position in a matrix with orderly properties. Interestingly, the 
article by Green (1956) noted above forms the only overt link between 

the S-P method and earlier, work in English on analysis of response 

. I 

patterns. ' ^ _ 

Where the S-P method "di verge - from its predec&ssors can be seen 
in the very reduced role played- by probability theory, and the 
ab-.ence of anything resembling tests of statistical significance (a 
shortcoming addressed below). Much of the work on the S-P method is 
either in Japanese or in English-language journals not generally 
available' in the West. -In the U.S. the small number of research 
presentations using the S-P method to date is smalV (Harnisch, 1980; 
Harnisch & Linn, 1981, 1982; McArthur, 1982; Tatsuoka, 1978; Tatsuoka 
& Tatsuoka, 1980). ' ^ 

Assumptions made by the model . The S-P method starts from a 
complete 'matrix of scores, doubly reordered by I rows and J columns. 
The model applies equally well to the trivial case of a 2 x 2 matrix, 
and to 2 X J and I X 2 retangular matrices; it also .appears to have no 
functional upper limit on the number of items orj^espondents . 
However, missing data' cannot be incorporated effectively. That is, 
each respondent and item must have complete data since all 
calculations are made with reference to i and j as constant values. 
For purposes of reordering, if two or more respondents have the same ^ 
total score. their ranKs..a-rfi.tied but their positions within the sorted 



matrix must be unique, so ties between marginals are resolved 
arbitrarily (a situation which could £ause some small instability in 
them's and P curves). I-n respect to both individual scores and sets of 
scores taken as a whole, no explicit probabilistic formulation is/ 
involved, although underlying the analysis of the matrix is a model 
premised on cumulative binomial or ^beta binomial distributions, with 
parameters I (number of^ses), J (number of items), and p (average 
passing rate). No study has been made of how guessing affects the 
obtained pattern of responses., nor how corrections for guessing might 
affect the S-P chart. Because of the very small number of assumptions 
made by the model, its- interpretation does not require a strong 
theoretical background, and in fact can be annotated easily by 
computer as an aid to the user novice. Indeed, the graphic 
reordering with overlay of S- and P-curves but no further statistics 
appears^sufficient to allow teachers, with use of a brief nontechnical 
reference guide, to make wel 1 -reasoned instructional decisions. 

One "implicit assumption deserves special attention. In the 
derivation of a caution index for item or respondent, the entire 
existing configuration of I items and J respondents, whether valid or 
not, enters into consideration. That is, because the f^me of 
reference does not extend beyond the data at hand, the derivative 
indices are inherently subject to limits on their analytic utility. 
However, it is important to recognize that for the great bulk of 
practical testing applications, such limitations in fact may be 
advantageous. Each index also depends on a linear interpretation of 
steps between marginal totals, although it is^readily demonstrable 
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that substitution of a highly discriminating item for a weakly 
discriminating one, or a very able examinee f9r a poor one, can alter 
many of the indices for both persons and items. Additionally, the 
linearity constraint treats all data elements within the matrix 
equally, despite unknown (and perhaps inestimable) contributions from 
' - chance .correct responses. On the other hand, without further tests of 
significance,, the resulting statistical uncertainties, which are small 
under most conditions, have little practical importance in the usua'T 
classroom situation. ^ / 

Sjtrengths and weaknesses , t Obvious strengths of the S-P system 
are itl simplicity, wide potent ie^^udi fence, and portability. The 
code required fo^ computer processing can be exceptionally brief. and 
with the increased availability of microcomputers, can be- delivered to 
the classroom teacher directly. According to Harnisch and Linn 
(1982), the caution indices compare well with Cliff's (1977) C-^ 
and C.^. Mokken's (1971) H* . , Tats-uoka and Tatsuoka's (1980) 
Norm Conformity Index (NCI), and van der Flier's (1977) U', all of 
which are harder to calculate as a rule. As an inherently flexible 
system, it appears to be suitable for a variety of test types, and for 
a range of analyses even within the same test. The novice user need 
not master the full range of calculations in order to make excellent 
^se of more elementary portions of the results. A sophisticated user 
can easily iterate selectively through an existing data set, choosing 
particular items or persons not meeting/Tome criterion for 
performance, and recasting the remaining matrix into a revised chart. 
Under certain conditions, addressed below, the method can be' adapted 
to examination of test bias (McArthur, 1982). 

er|c iix 
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Weaknesses include the following three general criticisms. No 
substantive body of psychometric or educational theory preceded the 
development of practical applications^ of the model because in fact its 
developtrfent was not paradigm-driven. Instead, the S-P techniques 
arose in response to a perceived need for classroom teachers to have a 
readily interpretable, minimal ly" compl ex tool for test analysis. 
Thus, at present little can be said regarding questions of 
reliability, validity, true scores, scaling theory, or quality of 
measurement. No extant work addresses either the problem of 
signal /noise ratio or of model fit. The absence of a strong 
theoretical base dampens the development of rationally interconnected 
research hypotheses, although the method offers ample opportunities , 
for direct investigation of individual performance and item 
characteristics. The absence of strong theory-derived hypotheses 
leaves a recognizable gap in the ability to draw strong inferences 
from the S-P method. That is, in developing a d,i agnostic 
interpretation of a student's score pattern, the teacher or researcher 
'must make a conscious effort to^B^lTnce t+ie evidence in light of some 
uncertainty about what constitutes critical or significant departure' 

• ' ( . 

from the expected. 

These weaknesses do not affect the classroom teacher to any major 
degree. In the classroom, the technique isjused for confirming 
knowledge about individual students gained in the course of 
interaction with the class, and/or to confirm that items on a 
particular test are reasonably well suited to the claa^.' From the 
researcher's viewpoint, the weaknesses constitute rather important 
blocks to further development. On the other hand, because of some 
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points of similarity between the S«P technique and less arcane aspects 
of a number of existing models, hypothesis building tends to proceed 
anyway* The absence of recognizable criteria for establishing 
statistical significances for degree of heterogeneity is an Important 
technical problem. Because the various indices appear to share a 
great deal in common wl^ttuindices having known statistical properties 
'from other research models, an initial direction fO,r such effort would 
be to examine these parallels. 

Present areas of application . All of the published studies in 
English to date utilize the S-P method exclusively in the context or 
right/wrong (1/0) scoring. These studies each use data collected from 
multiple-choice tests (generally reading or math) administered to - 
primary ^or secondary level students. In this body of literature the 
general application is ^either- to the task of Individual student 
analy^sis, or more frequently, to item analysis. W-ith a^n appropriate 
m1crocomputer--one marketed exclusively In Japan is conlFigured 
exclusively for the purposes of the S-P method— classroom teachers can 
use the technique Interactively. Science teachers In Japan are 
evidently the largest cluster of users, although details ^ibout 
acceptance and jlaily utilization remain sketchy. 

A different application arises in the context of larg^-scale \^ 
assessment. Harr^lsch (personal communication) reports that, several 
school districts have contracted for S-P analysis of mid-year and 

final achievement test scores: Several thousand Individual si tested on 

1 

dozens of items pose no new conceptual or mathematical complexity and 
in this situation the results can be used to address both 1tefn-le*/el 
and aggregate-level questions. | 1 1 J \ 

\ 

Strong parallels also can be found with aspects of the analysis of 
planar Wiener processes and spatial patterns, from the domai\n of 
mathematical geophysics: \ ■ . \ 
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Possible extensions of tl^e model . Three new directions for the 
S-P method are being explored. The first is the application of 
iterative procedures, first suggested by Green (1955)Jn a brief 
paragraph on p-tuple analysis of Guttman scales. Zimmer (1982) has 
collected extensive developmental data on children's -perception of 
various taslcs arid attributions; this data incorporates multiple 
discrete levels of performance arranged, .according to'theory, in a 
logical staircase ascendency. P-tuple iterative analyis by the S-P 
procedure appears to Iffer answers to thr/e quiestiois: a) does broad 
sample of children respond in an orderly amnner to the range of tasks; 
b) does such order reflect known characteristics of the sample (viz. 
developmental level as measured on standardi zed ^rocedur^s ) ; and c) do 
deviations from the symmetrical relationship between the developmental 
compleidty Of the task and the developmental level of the child t 
reflect coijsistent support for one or^'another competing theory of 
development^ For this data, separate S-P analyses were made with the 
'first developmental level scored o'and all others 1, then the first 
two levels scored 0 and all others 1, and soon. Stability of person 
order and item order, uniformity of the staircase i.ntervals, and 
relationships between it^m difficulty and 'item complexity can be 
studied. Preliminary evidence suggests that the S-P^method provides a 
system of analysis for such multi -level data that exceeds the 
explanatory powe^- of several extant procedures. 

In p-tuple analysis,* which makes use of repeated passes through 
data, some questions of a technical nature are unresolved at this 
.time. For example, it is clear that success|ye reorderings can 
perturb the positional stability of any one respondent, within the 
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matri)^ or one task within the n'atrix. to some degree. However, 
changes in ordering contribute to changes in the S-P indices, and 
whether such changes, *and/or linearity assumptions and violations 
therein, play, a^import ant role is also under study in the context of 
this developme^al d^a. Another way to think of this problem is to 
imagine a single matrix of persons x items with, the S-P chart from 
each dev'elopmental level over^^aid. The most difficult tasks would be 
accomplished only^by the most devel opmental ly advanced individuals, 
and below ascertain competence (i.e. the highest S-cu\ve on this 
compound chart) virtually no one would be expected to succeed on those 
tasks. The ordering of those participants who fai'l at all tasks of 



that difficulty level is arbilSrary, bejcause their total sc^ore for 
these most difficult tasks is zero. But. their ordering would not be 
arbitrary on tasks of moderate or low difficulty, at which more O 
successes might be anticipated and the corresponding S-curves would be 
located lower on, the chart. What constitutes kceptable and 
interpretable slippage qf this kind needs further probing. Perhaps 
the best analagy is to thfe term "se'iche," drawn from the field of 
oceanography: it refers io regular/ entirely predictable tidal 
motions occuring within coWined bodies of water. Such seiche in a 
polychotomous S-P chart -^u^ht to show itself totally consistent and 

predictable. ' 

The second area for development of the S-P method is in the realm 
of scalar data, for which' a number of statistical assumptions h'ave 
been developed. An example is signal detection analysis, in which the 
"raw element" of data is once again a 0/1 response, this tiy for 
absence or presence of perceived stimulus. A variety of complex 

> a 
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statistical techniques have been used to investigate how such stimuli, 
p-esented across a range of intensities over a repeated number of 
trials, are processed by the receiver. The analog in S-P;an-a1ysirs 
might best be portrayed as a three-dimer^sional matrix of persons, 
items, and repeated trials. Items are not necessarily objectively 
identical from tribal to trial, and responses are tempered by not one 

t 

but several possible orderly progressions. Such three-dimensional and 
higher-dimensional^ data challenges the S-P method to provide cohesive 
summary statistics which can be evaluated probabilistically. 

An extension of the S-P technique to the study of test bias^has 
been made by McArthur (1982). Whfere two distinct groups have been 
tested on the same instrument or on two instruments one of which is an 
exact translation of^ the other, S-P analysis offers an interesting 
alternative to the complex techniques for detection of biased items 
generally in use. McArthur studied the response patterns for items on 
the California Test of Basic Skills, administered to both 
English-speaking and Spanish-speaking children, the latter taking the 
CTBS-Espanoly when proportions of children achieving correct 

responses to a given item diffe*r between the two latiq^uage groups, the 
item may not be biased. However, the D* values for the 
student-pfgblem matrices calculated. separately for the two groups 
suggest that the Spanish-language group engaged in more random 
responding than did their English-speaking counterparts. A 
significantly larger number of items for the fromer group show\l^at 
those children above the P-curve (cr^ildren who in a case of "symmetry" 
as defined earlier would be expected to do well) who gave the correct 
response were frequently Jewer in number than the corresponding sample 

11^ ^ 
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from the English-language group. That is, deleting cases below the 
P-curve, which are more -likely to have engaged in random responding, 
leaves a finite number of respondents for whom the prediction of 
success is high. Obviously 'ojj easier items this reduced sample is 
larger than for difficult items because of the shape of the P-curve. 
Nonetheless, whfle the p values for a given item may differ 
significantly between one group and the other, thte proportions of 
right efnswers above the P-Zcurves can be statisticall^identical . To 
establish evidence of bias, the additional requirement is that for 
students in the disadvantaged group who by their pattern of 
performance on the test as a whole should have succeeded with a 
particular item, that item generated erroneous responding for one 
group more than for another. ' <•- 
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THE RASCH MODEL FOR ITEM ANALYSIS 

Bruce Choppin 
Center for the Study of Evaluation, UCLA 



, 1 . Definition of the Model 

The so-called Rasch model now widely employed for item analyses, 
is only one of a complete family of models described by Rasch in his 
1960 text. All may be .prop e^ called "Rasch Models" since they share 
a common feature which Rasch labeled "specific objectivity". This is 
a property of most measurement systems which requires that the • 
comparison of any two objects^th^J^ave been measured shall not. depend 
upon which measuring instrument or instruments were used. It is a 
familiar feature of many everyday piV'iical measurements (length, time, 
weight, etc.). In the context of mental testing, it means that the 
comparison of two individuals who have been tested should be 
independent of which items were included in the tests. Traditional 
test analysis based on- "true scores" does iiTSt have this property since 
"scores" on one test cannot be directly compared to "scores" on 
another. (The peculiar virtues of specific objectivity and the 
conditions needed to achieve it are discussed later in this chapter.) 
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Mathematical Representation 

The Rasch model is a mathematical formulation linking the 
probability of the outcome when a single person attempts a single item 
to the characteristics of the^ person and the item. It is thus one of 
the family of latent-trait models for the measurement of achievement, 
and is arguably the least complex mem\)er of this family. In its 
simplest form it can be written > 

, 
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Probability [Xyi =1] 



where, Xyi takes the value 1 if person v responds correet^y^ 
to item i, and zero otherwise, 

Av is a par,ameter describing the ability of person v, 
and is a parameter describing the difficulty of item i. 

In this formulation, A and D may vary from C^to A 
transfiormation of these parameters is usually introduced to si^jjf>44fy 
much of the mathematical analysis. This defines new parameters for 
person ability (a) and item difficulty (5 ) to satisfy the equations: 

Aw = W ^ and Di = W * for some constant W. 
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Figure 1 : Item Characteristic Curve (wits) for the Rasch Model 
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A further simplification, introduced by Rasch himself and used 
widely in the literature, is to fix the constant W to the natural 
logarithmic base, e. In this case the model can be written:. 



t 

(2) ' Probability [Xyi = 1] = — ^— r , where t = ( a - 5- )• 
. • 1 + e^ ^ 

rjB this formulation, a and 6 can tklte al"Kieal values and measure 
ability an^ difficulty respectively on the sanie "Ibgit" scale. 
The sign of the expression (a - 6 ) in any particular instance ^ 
indicates the probable outcome of the person-item interaction. If 
a > 5 then the most probable outcome is a correct response. If 
a< 6 then the most likely outcome is an incorrect response. - 
It should also be noted that the "odds" for getting a corrects 
response (defined as the ratio of the probability for getting one 

to the probability for not getting one) take on a particularly 

C 

simple form: 



e^ 
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1 + e = e 

Odds [XvT= 1] = r-^ — 

1 + e^ 

■> 

or t = loge(odds) 
For'this reason, the Rasch model is sometimes referred to as the 
"log-odds" model . 
Alternative Units 

As stated above, the model based on the exponential function 
yields measures of people and items on a natural scale, whose unit 
is called a "logit". Rasch himself used the model in this form. 



and most of Wright's publications also make use of it, Matrtematifcally 
and computationally the I'ogit is convenient, but as an operational 

unit it has two drawbacks. First, a change in achievement of one 

ft 

logit represents a considerable amount of learning. Studies in 
various^ parts, of the world indicate that in a given subject area, the ^ 
typical child's achievement level would rise by rather less than half 
a logit in a typical school year. In practice, many of the 
differences in achievement level that we need to measure are much less 
than this, as is the precision yielded by our tests, so results are 
commonly expressed as decimal fractions rather than as integers. 

Secondly, logits are usually ranged around a mean of zero (this ^ 
is a matter of convention rather than necessity )- so that half of all 
the values obtained for parameters are typically negative. In 
general, teachers dislike dealing with negatiyeHPfilmb^rs, and the 
prospect of having to explain to an anxious parent what Jiminy's change 
in math achievement from' -1.83 logits to -1.15 logits actually means 
is too much for most of them. 

The solution for practical applications of the Rasch scaling 
techniqueM's to use a smaller and more conveniertt unit. This is 
accomplished by setting W to some value other than e. A number of . 
alternatives have .been suggested, but the unit in the widest use after 

0.2 . 

the logit is obtained by setting W = 3 . This unit is known as the 
"wit" in the United Kingdom and United States, and as -the "bryte" in 
Australia. Wits are typically centered around 50 with a range from 

« 

about 30 to 70. One logit is equal to 4.55 wits. For'many purposes ' 
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it is suffic^j^ent to report wits as integers. The particular value for W 
is chosen so as to provide a set of easily memorized probability values, 
as can be seen in the Table 1. 



Table 1 

The Relationship of Logits and Wits to the 
Probability of Correct Response 



(a - 6) Measured 


(a - 5) Measured 


Probability of a 


in Logits 


in Wits 


Correct Response 


-2.198 


• -10 


0.10' 


-1.099 


-5 


0.25 


0 


0 


0.50 


+1.099 


+5 


0.75 


+2.198 


+10 


0.90 



It must be emphasized that the choice of a unit for reporting is an 
arbitrary matter. Most of the theoretical work on the model, and all 
the computer programs for parameter estimation in common use, work in 
logits— translating to wits or some other scale for reporting only if 
desired. 

Analytic Possibilities 

Parameter estimation is a difficult issue in latent-trait 
theories. That for Ra^ch model a variety of different estimation . i 
algorithms (at least ^s'ix) have become available in the last fifteen 
years results 'from the mathematical simplicity, of the Rasch formulation. 
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The basic equation models only the .outcome of one particular 
item-person interaction, but since it does so in terms of a 
probability function, it is necessary to accumulate data from several 
such interactions in order to estimate parameters or test the fit of 
the model itself. 

For example, the accumulation of responses of one individual to a 
set of items may be used to estimate the ability parameter for the 
individual, and the pattern of responses by several individuals to two 
items may be used to estimate the relative difficulty of the two 
items. From a (persons-by-items) response matrix it is .possible to 
estimate both sets of parameters (abilities and difficulties), and 
also to check on whether the model is an acceptable- generating 
function for the data. This calibration of items, and the test of 
goodness-of-fit to the model, correspond to item analysis procedures 
in classical test theory (but see section 5(a)). 

Once items have been calibrated, equations can be developed to 
predict the characteristics of tests composed of different samples of 
previously cal ibrateji-4tems. or th'e performance of previously measured 
people on new items. Although the simplest approach to statistical 
analysis requires a complete rectangular persons-by-items response 
matrix, other procedures are available to handle alternative data 
structures. For example, when a group of individuals take different 
but overlapping tests, the persons-by-items matrix will necessarily be 
incomplete, but it is still possible to calibrate the items and 
measure the people. An extreme example, in which a computer-managed 

I2ii ' 



adaptive test is individually^ tailored to each testee (sucln that the 
next item given depends on the responses to previous items), may lead 
to a situation in which every person tested may respond to a unique 
set of items. If the items have been calibrated in advance, it is 
possible to estimate the indi vidual ' s abil ity parameter at each step 
of the sequence, and to discontinue testing when the ability has been' 
. measured with the desired degree of precision. 
Estimation Tecliniques 

Although this paper is not the place for a detailed "presentation 
of the algebraic manipulation involved in the various algorithms for 
parameter estimation, an outline of the different approaches may be 
helpful. 

f Conventionally the starting point is taken to be a rectangular 
matrix of persons by Iteras in which the elements are one if a 
part:icular person responded correctly to the appropriate item, zero if 
he. responded incorrectly, ai^d blank if the person was not pr'esented 
with the item. Initially we shall restrict the discussion to complete 
matrices of ones and zeros such as occur when a group of N people all 
attempt a test of k items. In most applications N is usually much 
larger than k Two summarizations of data contained in the N x k 
matrix leads to effective strategies for parameter estimation (see 
Figure 2). 

One, known as the "score-group method" clusters together all 
those persons who "had a particular raw score, and then counts within 
each cluster' the number of correct respofes to each item. This 
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Figure 2 ; Data redu^^Vnt. strategies for Rasch. parame ter estimation . 



produces a score-group by item matrix as in Figure 2A, The other 
method considers the items two at a''time, and counts^ for each pair the 
number of persons who responded correctly to tlje first but incorrectly 
to the second. This is known as the "pair-wise" approach and produces 
an item by item matrix as in Figure 2B. (A parallel analysis 
comparing the people two at a time can be developed theoretically, but 
has found little Practical application.) Both the score-group and the 
pair-wise approaches-fare described by Rasch in his 1960 book, but 
without the develop«ient of a maximum likelihood technique he was 
unable to exploit them. 

The score-group method produces a (k + 1) by k matrix, but since 
raw scores of zero and k do not contribute to the estimation 
procedure, the summary yields k{k - 1) elements for^use in the 
estimation algorithm. The pair-wise approach results in. a k by k 
matrix in which the leading diagonal elements are always zero, so 
again there are k(k - 1) elements in the summary on which the 
estimation algorithm operates. 

Analysis of score-group matrix to separate information on a and 
5 and thus obtain fully conditioned estimates for both the item 

difficulty parameters and the abilities associated with membership of 

ft 

score-group 1 through k - lis computationally demanding and 
expensive. The best^aVailable procedure has been programmed by 
Gustafsson (1977), but, though mathematically elegant and 
statistically sound, it is far too expensive for routine use. 
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However, Wright has shown that estimates developed f rom -^he margins of 
the score-group matrix cian be developed very easily using a maximum 
likelihood approach. Though the simultaneous estimation of both cx 
and 6 sets of parameters introduces a bias, a simple exp'ansion 
factor .<^1ied to the results can largely correct for this (Wright & 
Douglas, 1977; Hahtniiann, 1977), and this method is widely used in 
practice. When the data are summarized in a score-group fashion, they 
are convenient for checking the assumption of equal discriminating 

power between items and the tests of fit developed by Wright and Mead 

\ - 

(1976) concefttcate on this. ^ 

By contrast," the pair-wise approach separates information about 
the 6's from information about the a's at the beginning. The matrix 
of counts summarized in Figure 2B has conditioned out all information 
about variations in a , so that a fully conditional estimate of the 

6*s (either by maximum likelihood or least squares) ^can be 
obtained. The ability estimates for each'individu^l are developed 
from solving iteratively the equation: 

i=i r+ w"iT 



where r is the raw score of the person, and the summation. extends only 
over those items that were attempted. 
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The t&st of fit applied to the pair-wise summary matrix is. not 
very sensitive to violations of the equal discrimination power 
assumption (see section 3). but instead focuses on the iss.ue of local 
independence between items (Choppin &- Wright, in progress). In 
practice, therefore, the two approaches may be regarded as 
complementary. 

Though slower than the Wright estimation algorithm based on 
score-group marginals, the pair-wise approach has the considerable 
advantage, of being able to handle incomplete data matrices— - 
corresponding to all those applications U which not every person 
attempts every item. It is thus of pa(rticular interest , in such fields 
as adaptive testing and item bankii/g (Choppin, 1978, 1982). 



V 2. The Measurement Philosophy and Primary Focus of Interest 

/ 

/ . . 

/ 

Although it turns out /(hat the mathematical details have mucti in 
common with those of "iteni response theory", Rasch derived his models 

i- 

from a very different standpoint. In the first paragraph of the 
preface to th;e book which launched his ideas on measurement (Rasch. 
1960) he quotes approvingly an attack by B.F. Skinner on the 
application of conventional statistical procedures to psychological 
research. 

"The order to be found in human and animal behavior 
should be extracted from investigations into ' 
individuals ... psychometric methods are inadequate for 

M such purposes since "they deal with groups of 

^ individuals." (Skinner, 1956, p. 221) 
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Group-centered statistics, which form the backbone of.. 
conventional psychometric practice (factor analysis, analysis of 
variance, etc.) require the clustering of individuals into discrete 
categories or populations, and further make assumptions about the 
nature of variation within these categories which Rasch viewed with 
grave distaste. The alternative was to develop methods which would 

work with individuals. 

"Individual -centered statistical techniques require 
models in which each individual is characterized 
separately and from which, given adequate data, the 
individual, parameters cejn be estimated. It is further 
essential that comparisons between individuals become 
independent of which particular instruments - tests, or 
items or othtr stimuli - within- the class considered 
have been used. Symmetrically, it ought toh^ possible 
to compare stimuli belonging to the sarr^'-cTass - 
measuring the same thing - independent of which 
particular individuals within the class considered were 
instrumental for the comparison." (Rasch. I960, p. vii) 

In this excursion into what he later call^ "specific 

objectivity", Rasch is echoing a theme developed explicitly by 

L.L. Thurstone three d'^^^ades earlier: 

"A measuring instrument must not be seriously affected 
in its measuring function by the object of 
measurement. To the extent that its measurement 
function is so af^'ected, the validity of the instrument 
is Impaired or limited. '^if- a yardstick measured 
differently J)ecause of-tfte fact that it was a rug. a 
picture, or a piece of paTper that was being measured,, 
then to that extent the trustworthiness of that 
yardstick as a measuring device would i>e impaired. 
Within the range of objects for which the measuring 
Instrument is intended its function must be independent 
of the object of measurement. " (Thurstone, 1928,- 
p. 547). 
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Reliance on this form of analogy to the physical sciences is 
quite characteristic of latent trait measurement theorists. Wright 
(1968, 1977) also uses the yardstick as a convenient metaphor for a 
test item. Others (Eysenck. 1979; Choppin. 1979. 1982) have pointed-, 
out the similarities between the r^easurement of mental traits and the 
measurement of temperature. The underlying premise is that although 
psychological measurement maybe /father more difficult to accomplish 
than is measurement in 'the fields of physics and chemistry, the same 
general principles should apply. Features which are characteristic of 
good measurement techniques i/n phys-ics should also be found in the 
fields of psychology and education. 

Rasch himself draws out the similarity between the- development of 
his model, and Maxwell's analysis of Newton's laws of motion in terms 
of the concepts force and mass (Maxwell, 1876). The second law links^ 

( 

force, mass and acceleration in a situation where although 
acceleration and its measurement have been fully discussed, the 
concepts mass and force are not yet defined. Rasch (1960, pp. 
110-114) considers the necessity of defining the two concepts in terms 
of each other, an/l shows how appropriate manipulatjon of the ^ 
mathematical model (the "law")' and the colfection of suitable data can 
lead to the (comparative) measurement of masses, and the (comparative, 
measurement of forces. He points out the close analogy to his 
item-response model which links ability, difficulty and probability. 
Ability and difficulty Inquire related definitions since people need 
tasks onN^hich to demonstrate their ability, and tasks only exhibit 
their-~difficulty when attempted by people. Since his model is 
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"specifically objective", data can be collected so that the two sets 
of parameters are capable of separate estimation (as with force and 

mass). ' a,. 

This approach to measurement is the primary focus of interest for 
the Rasch model. Individuals are to pe measured through the 
estimation of parameters characterizing their performance. These 
parameters shall be interpretable by comparison with the parameters 
estimated for other individuals (as in norm-referencing) and/or in 
conjunction with the parameter estimates for test stimuli (as in 
criterion-referencing). 

3. Assumptions made by the Rasch Model 

The basic assumption is a simple yet powerful one that derives 

* 

from the requirement of specnf-ic objectivity, so central to Rasch 's 
thinking about measurement. It is that the set of people to be 
measured, and the set of tasks (items) used to measure them, can each 
be uniquely ordered in terms respectively of their ability and 
difficulty. (Ability and difficulty as already described.) This , 
ordering permits a parameterization of people and tasks that fits the 
simple model defined in. section 1 above. • 

The basic- assumption has a number of important implications. One 
such assumption is that of local independence. The probability of a 
particular individual responding correctly to a particular Item must 
not depend upon the responses that have been made to the previous 
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items. If It -did, then altering the sequence of items that md<e up a 
particular test, would alter the ordering of people on the underlying 
trait (in violation of the basic assumption). Similarly, local 
independence requires that" the response of an individual to a 
particular item is not affected by the responses given by other people 
to the same item. If it were,, then it would be possibl^, by selective 
clustering of people, to change the ordering of items in terms of 
their difficulty (in violation of the basic assumption). ^ 

Another implication that follows from the basic assumption of the 
model is sometimes stated (rather confusingly) as "equality of 
discrimination". It must be emphasized that this does not mean that 
all items are assumed to have equal poi nt-bi seri al correlation indices 
with total test score, or with some external criterion. Rather, it 
means that the signal /noise ratio represented by the maximum slope of 
the characteristic curve of each item is assumed to ^e the same for 
all items. If the slopes were not the-same, then at some point the 
characteristic curves for two items would cross. This would mean that 
the ordering of the'items in terms of difficulty for persons of lower 
ability would not be' the same as the ordering for persons of higher 
ability (see Figure 3). This again violates the basic assumption. 

/ 
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Figure 3 





(a) Characteristic curves for items 
'that fit the Rasch Model. 



(b) Character is t^-c curves for two items 
with different discriminations. 
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Urn" -dimensionality is alsq a consequence of the basic 
assumption. If/the performance of people on a set of items depended 
on their individual standing on two or mor| latent trafts^ such that 
the ordering of people on these latent traits was not identical > then 
it would be impossible to represent the interaction of person ^nd 
task with a single person parameter for ability. 

A further assumption and one which is mathenjatical ly very 
convenient, albeit somewhat unrealistic (at least on multiple-choice 
items), is that there is no random guessing behavior. The model 
requires that for any test item» the probability of a successful 
response tends asymptotically to zero as the ability of the person 
attempting it is reduced (see Figure 1). 

Similarly, there is a built in assumption, which has been mueh 



less carefully explored, that as the ability of the person being 
considered increases, the probability of a successful response to any 
given item approaches one. 

13U 
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4. Strengths and Weaknesses and Gaps in the Dev-elopmeht 



The strong features of the Rasch model when compared with other 
measurement models are: 



(a) ' The combination of specific objectivity, a 

property taken for granted in the field of 
physical, measurement, and the model's mathematical 
simplicity. 

(b) Deriving from this, the separability property 
which permits the estimation of person-parameters 
and item-parameters separately. - , 

(c) The existence of several algc^thms for parameter 
estimation some of which are Extremely fast and 
which work well with small amounts of data. 

(d) The inbuilt flexibility of the system. As with 
•other latent -trait models which are defined at the 

item level, there is no requirement that tests be 
of a fixed length or contain the same items. 

(e) The close parallels >that exist between the Rasch 
model and the conventional practice of calculating 

' raw scores based on an equal weighting of items. 
Rasch models are the only latent-trait models for 
which the raw score, as conventionally defined, is 
a sufficient statistic for ability (and 
correspondingly the raw difficulty or p-value of 
an item is a sufficient statistic for Rasch 
difficulty). 

Against this- it must be admitted that there are areas of. 

considerable weakness. The most serious focuses on the assumptions 

made by the model. These are, in general, too strong to carry full 

credibility. In practice some real data appear to fit the model 

rather poorly. The assumptions of local independence and of no 

guessing (which are crucial, to the model) are not strictly met in 

i ■ • 

practice. Although the psychometrician may be able to reduce the 

guessing problem through the avoidance of objective items, and may b 
able to structure the test and the conditions under which it is 

136' 
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•administered to improve local independence » in real life situations 
these problems are rarely completely eliminated. The model also 
demands (as do most others) uni-dimensional ity (or, as Rasch calls it, 
conformability), and while the items that comprise many existing tests 
fail to meet this criterion, , the problem is less critical. If one has 
control over the test construction phase of a measurement program, 
then it is possible to build sets of items which satisfy the 
uni-d:imensionalit^ assumption moderately'well . 

One feature of the model which has been described as a weakness 
(Goldstein, 1979; Divgi., 1981) is that it implies a unique ordering of 
items, in terms of their difficulty, for ari_ individual s. Thi^ 
appears not to be sufficiently sensitive to the effects of 
instructional and curriculum vari ation,-aTTd-^tands, therefore, as an 
important criticism (but see Bryce, 1981). 

The seriousness with which- such objections need to be considered 
.depefjds uport, the- nature of the measurement task being addressed. Most 
educational instruction programs aim at increasing the learning of the 
student and thus at increasing his ability to solve relevant test 
items. We would usually expect the ability to solve all relevant test 
items to increase— but the relative difficulty of the items could- (and 
normally would) remain unchanged'. While this is the dominant goal of 
instruction, the model can handle the situation appropriately, and the 
occasional changes in relative difficulty brought about by alternative 
curricula (see, for example, Engel , 1376 or Choppin, 1978) can shed 
considerable light on the real effects of the instructional program. 
If, however, a section of curriculum is aimed specifically at breaking 



down softi? piece of learning and replacing it with another (i.e- making 
some items more difficult to solve^ and other easier) such as may 
occur during revolutionary changes in society then we may well feel 
that the simple model proposed is inadequate to describe^,the 
situation.^ In this case the items measuring th^ "old" -learning and 

^the "new" do not seem to belong on the same scale. Such 

> * 

circumstances;, however^ are not routine in the United States • 

Similarly^ we find in general that the ordering of item . 
difficulties is the same with respect to all students. Where one 
student differs significantly in finding some item much harder or 
easier than predicted by the models then we have valuable diagnostic 
information about that individual (Mead, 1975). In practice we rarely 
find evidence for such differences^ and where they do occur the 
Interpretation is usually clear and direct (for example^ the student 
missed instruction on a particular topic).' If we were attempting to 
measure in an area where there was no common ordering of item 
difficulties for most students^ then the model would appear quite 
inappropriate. Such situations may be simulated by creating test 
items whose solution depends upon luck or chance, but this is far 
removed from pufposiye educational testing. 

Experience over the last two decades suggests that the 
simplification made by the model in requiring a unique ordering of 
"items is met aldequatelyin practice. Deviations, where they do occur, 
are indicators of the need for further investigation (Dobby & 
Duckworth, 1979; Choppin, 1977). There seems little reason', 
therefore, to regard this as a weakness of the Rasch approach. 
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5. A?^eas of Application 

The basic form of the model proposed by Rasch, and described in 
section 1, dealt with the simplified situation where Only two^possible 
outcomes of a person attempting a test item were considered (i.e the 
response is scored "right" or "wrong"). For this reason, perhaps, 
most of the applications so- far developed have been confined to the 
use of "objective" test items for the measurement of achievement since 
these are most natujf*ally scored in-^this fashidn. 

(a) It6m Analysis 

The most frequent application of the nxidel has been for item- 
analysis. Users have wanted to confirm that the moctel fits data they 
have already accumulated for existing tests; they seek clues as to why 
particular tests are not functioning as well as they should; or in the 
construction of new tests they seek guidance as to which items to 
include and which to omit. 

It is probably true to say, however, that the Rasch model has not 
proved particularly valuable, in any of these three roles. It can 
detect lack of homogenity among items, but is probably less sensitive 
to this than is factor analysis. It can identify items that do not 
discriminate or for which perhaps the wrong score key has been 
selected, but ft seems no more effective at this than is the more 
traditional form of item analysis. Th^^ exception to this 
generalization probably comes when tests are being tailored for a very 
s*pec1fic purpose. Wright apd Stone explore this in "Best Test Design" 
(1979). CarefulSrdherence to all the steps they outline would 
.probably yield a te^t with better characteristics for the specific 
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and intended purpose than would a tes^t produced on the basis of only 

/ 

traditional forms -of item^ analysis and the crude criteria they employ, 
(b) , Scaling and Equating 

A seriQQs problem of traditional testing is that the "score" 
produced can only be Interpreted in terms of the particular test 
used. The developirent of norms for standardized tests is an attempt 

it* * 

to overcome this problem but this too has serious limitations. Latent 
trait scaling has been used to tackle this question directly. With 
the Rasch model, the raw scores on one test are mapped onto their 
latent trait scale, and different tests can of course have their 
scores mapped onto the same scale (provided always that the dimension 
of abil|ty being measured is the same). The metiiod has been used to 
compare "quasi-parallel" tests (e.g.. Woodcock, 1973; Willmott & 
Fowles, 1974); to link the tests given at different stages of a 
longitudinal study (Engel, 1976; Choppin, 1978); and to check on the 
standardization characteristics of batteries of published tests (Rentz 
S Bashaw, 1975, 1977). 

It should perhaps be noted that although equating using the Rasch 
model appears more flexible than traditional procedures in 'that only 
the difficulty level of the two tests is being compared and other 
characteristics such as test length, the distribution of item 
difficulties, etc. maybe quite different, there is an implicit 
assumption that the "di scrimination. power" (in the sense discussed 
above) of the items in the two tests are comparable. As a rule this 
implies that the item types are similar. Attempts to use the Rasch 
model to equate multiple choice and essay type tests on the same topic 
have led to inconsistent and bizarre' results (Willmott, 1979; Vincent, 
1980). -^^^^ 
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(c) Item Banking 

Item banks take the equating of test scores to its logical limit 

by calibrating all possible performances on all possible tests 

composed of items drawn from a fixed set (the bank). 

When a family of test items is constructed so that they 
can be calibrated along a single commpn dimension and 
when they are employed so that they retain'^these 
calibrations over a useful realm of application, then a 
scientific tool of great simplicity and far reaching 
potential becomes available. The "bank" of calibrated 
items can serve the composition of a wide variety of 
measuring tests. The tests can be short or long, easy 
or hard/wide in scope or sharp in focus. (Wright, 
1980). 

An item bank requires calibration^ and although in theory there 
are alternative approaches, in practice the Rasch model has proved by 
far the most cost effective and is the most widely used (Choppin, 
1979). 

(d) Qual i ty of Measurement . 

An important development that is facilitated by latent trait 
scaling is the calculation of an index to indicate the quality of 
measurement for each set of test data, and if necessary for each 
person attempting a test or for each item. The Rasch model, for 
example, yields an explicit probability for each possible outcome of 
every interaction of a person and an item. Where, overall, the 
probabilities of the observed outcomes are too low we may deduce that 
for some reason the Rasch model does not offer an adequate description 
of a particular set of data. If the probabilities 'are generally in 
the acceptable range, but are low for a particular item, then we may 
conclude that this is an unsatisfactory item. Perhaps it does not 
discriminate, or is addressing some different dimension of 

I 
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achievement. If the probabilities are generally acceptable but are 
low for a specific person, then we mc^ conclude that this person was 
not adequately measured by the test (perhaps he guessed at random, was 
insufficiently motivated, or misunderstood the use of the answer 
sheet). The reporting for this person of a low measurement quality 
index would imply that the person's score should be disregarded and 
^^^thdt a retest is appropriate. 

A recent e)|^tension of this approach involves trying to identify 
with-irr the vector of item responses from a particular individual those 
portions which provide reliable measurement information, on which 
items (or groups of items) the subject appears to have guessed at 
random, and how the total vector of responses may be selectively 
edited in order to provide a more reliable estimate of the subject's 
level of achievement. 

6. Extensions to the Basic Model 

Two types of adaptation and extension will be considered here. 
The first centers around^^the notion of sequential testing in which 

evidence of the level of ability of the subject is accumulated in 

p 

Bayesian fashion durimg the test session and may be used^to determine 
which items are to be attempted at the next point of the sequence 
and/or when to terminate the testing session. This approach relies 
upon the existence of difficulty calibrations for a pool or bank of 
test items. Most of the reseach that has been done so far has 
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employed computers to manage the testing session: to select items for 
the subject to answer, to keep track of measurement quality, to 
generate up-to-date estimates of the ^biXity of the subject (together 
with the appropriate standard errors) and to decide when the session 
should be terminated. Wright and Stone (1979) point out tha« 
individual people can do most af this for themselves if provided with 
suitable guidelines and conyjutational aids, and in many circumstances 
making the -learner responsible for evaluating his own learning is a 
useful thing to do. * 

The second area of development from the basic Rasch model is in 
the extension from" simple dichotomous scoring of items (right-wrong) 
to a more complex ^system. Two separate situations need to be 
considered. The first is when an item is not answered completely but 
enough is done to earn some partial credit. Data would then consist 
of scores in the range 0 to 1 for each item. The other case is that 
wh^ch typically occurs with rating scales or attitude measures when 
the respondent is asked to chopse one from among a finite number of 
discrete categories, and each category contains information about the 
standing of the respondent on some latent trait. Douglas (1982) has 
considered the theoretical implications of generalizing the basic 
Rasch model to include both these cases, and it turns out that almost 
everything thaj: can be done for dichotomous items can also be done fo 
these more complex methods of scoring. For the rating scale problem 
both Andrich (1977) and Wright and Masters (1982) have found it 
convenient to concentrate on establishing the lopation of thresholds 
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(the point at which the probability for reponding in one category 
passes the probability of responding in the next one - Figure 4). 
Wright and Masters have produced some interesting theorems about the 
importance of these thresholds being properly ordered, and about the^ 
spacing of thresholds that maximizes the information gained. There 
have been few prtactical applications of this approach to date. 




latent trait 

Figure 4 : The Probability of Responding in various categories. 



* For the analysis of "partial credit" data two computer programs 
(CREDIT by Masters and POLYPAIR by Choppin) have been devised and 
applied to real data sets. The latter program, for example, was used 
in the assessment of writing skills Which forms part of the British t 
National Assessment Program. 
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7. Points of Controversy 

In some ways the Rasch model represents a revolutionary approach 
to educational measurement that discards many time-honored constructs 
in testing theory (e.g., true score, measurement error, and 
reliability). On the other hand, it can ^be viewed as providing a 
comprehensive and sound mathematical underpinning for the conventional 
practice of using raw scores, and shows that in most testing 
applications raw scores are all that are required- From this point of 
view the Rasch model may be seen as less radical than other latent 
trait models. Perhaps because the former view of the model was the 
first to catch the imagination in the United States and has dominated 
efforts to popularize it, it has been a subject of continuing 
co/itroversy- The most strident arguments are not concerned with how 
best to use the Rasch model, but whether or not its use is ever 
appropri ate. 

To some extent the Rasch model has been central in the general 
attack on latent trait theory as applied to the measurement of student 
achievement. Goldstein (1979) who has led this atJ^Ctc on the other 
side of the Atlantic, stresses the fundamental difference between what 
he regards as well-ordered traits such as aptitude and intelligence on 
the one hand, and the complex pattern of behaviors that we call 
educational achievement on the other. In his view it makes no sense 
to apply any unidimensional model to the assessment of achievement. 

Less extreme in their implications are the arguments within the 

latent trait camp about whether the^Rasch (i.e., one-parameter) model 
4 

lii5 



is adequate for achievement testing, or whether a more complex 

(usually three-parameter) rrbdel is indicated. 

\ ' ^ 

It is important to differentiate two kinds of usage. One is in 

test construction where in general the users of Rasch models appear to 

be on firm ground in claiming that\a strategy to develop and select 

items that conform to the Rasch model will produce better test 

instruments than would other more conventional strategies. The other 

type of usage is concerned witJi the analysis of existing test data 

(for example, the massive data sets of NAEP or the accumulated files 

of SAT material at ETS) where items are likely to be so varied (and in 

many cases so poor) that it is comparatively easy to show that the 

Rasch model is not appropriate. Devotees of the Rasch model react to 

this by dropping the non-fitting items (which may well be the 

majority) and working with those that are left—but this cavalier 

approach does not commend itself to many researchers. If one is 

inte'-ested in analyzing and scaling data sets which include some 

possibly very bad items, then something like the three-parameter model 

is going to be needed* 

This difference of emphasis among the areas of application has 

its origins in contrasting views of measurement philosophy. As the 

next paper in this collection makes clear, the Rasch model can be 

regarded as a special case of the three-parameter model when the 

discrimination parameters are held equal, and the "guessing" parameter 

is fixed at zero. Mathematically, this view is undoubtedly 

correct— but philosophically, it is very misleading. Rasch developed 

his model, in ignorance of Lord's seminal work on item characteristic 
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curves, on the basis of a set of features which were necessary for an 
objective measurement system. For measurements with the required 
properties he found that his model, or a simple mathematical 
transformation of it, was the mathematically unique solution. The 
three-parameter model that forms the basis of Lord's Item Response 
Thoery is not, ar^d cannot be, "specifically objective"- Those whose 
main interest is in understanding existing data sets, and therefore in 
careful n^deling of observed ICCs, sees* little benefit or relevance in 
speific objectivity. Those who wish to construct instruments to 
measure individuals optimally tend to prefer the approach which offers 
the stronger and more useful system. ICCs which^ refl ect the behavior 
of inefficient or ineffective items have little interest for them. As 
has been suggested earlier in this paper, the Rasch model supports a 
range of applications which goes, well beyond what a latent trait model 
that is not specifically objective can manage. 

In the view of this writer, much of the energy which has fueled 
professional arguments over which is the better model (and the many 
research studies whose main goal was to compare the effectiveness of 
the two models in exploring a particular set of data) stem from a 
failure to appreciate that the two models are basically very 
different, and were developed to answer different questions. Neither 
is ever "true". Both are merely models, and it seems clear that in 
s^me applications one is of more use than the other and vice versa. 

Among users of the Rasch model there is little that is currently 
controversial, due in no • small part to the dominance of two computer 
programs now in use around the world (BICAL developed by Wright and 

Er|c i l7 " 



- 4.29 - 



his associates in Chicago, and PAIR developed by Choppin in London). 
One current issue that requires clarification concerns the status of 
"tests of fit". It is generally conceded^-^Jiasch users that whereas 
better tests of fit are available for the Rasch model than for most 
other psychometric models, they still leave a lot to be desired. In 
most cases, showing that an item does not fit the model merely 
requires collecting a sufficiently large body of data. The area of 
disagreement lies between those who prefer to treat fit/misfit as. a 
dichot^us categorization and draw up decision rules for dealing with 
test data on this basis, and those who prefer to regard degree of 
misfit as a continuous variable which needs to be considered in the 
context of the whole situation. The present writer belongs in the 
latter camp, but is prepared to admit that many bf the "rules of _ ^ 
thumb" that have been developed lack much theoretical or empirical 
basis. 
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THE THREE-PARAMETER LOGISTIC MODELS 

Roirald K. Hambleton 
University of Massachusetts, Amherst 



1. Definition and Background 

In a few words, item response theory postulates that (a) examinee 
'test performance can be predicted (or explained) by a set of factors 
called traits, latent traits, or abilities, and (b) the relationship 
between examinee item performance and the set of traits assumed to be 
influencing item performance can be described by a monotonical ly 
increasing function called an item characteristic function . This 
function specifies that examinees with higher scores on the traits 
have higher expected probabilities for answering the item correctly 
than examinees with lower scores on the traits. In practice, it is 
common for users of item response theory to assume that there is one 
dominant factor or abi-lity which explains performance. In the 
one-trait or one-dimensional model, the item characteristic function ^ 
is called an item characteristic curve (ICC) and ,it provides the 
probability of exafiiinees answering an item correctly for examinees at 
different points on the abi.lity scale. In addition, it is common to 
assume that item characteristic curves are described by one-, two-, or 
three-parameters. The interpretation of these parameters will be 
described in section 3. In any successful application of item 
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res^ponse theory, parameter estimates are obtained to describe the test 
items, ability estimates are obtained to describe the performance of 
the examinees, and there is evidence that the chosen item response 
model, at least to an adequate degree, fits the test data set 
(Hambleton, Murray, & Simon, 1982). 

Item response theory (or latent trait theory, or item 
characteristic curve theory as it is sometimes called.) has become a 
very popular topic for research in the measurement field. There have 
been numerous published research studies, conference presentations, 
and diverse applications of the theory in the last several years (see 
for example, Hambleton et al . . 1978; Lo>d, 1980; Weiss, 1980). 
Interest in item response models stems from two desirable features 
which are obtained when an item response model fits a test data set: 
Descriptors of test items (item statistics) are not dependent upon the 
choice of examinees from the population of examinees for whom the test 
items are intended, and this expected examinee ability scores do not 
depend upon the particular choice of items from the total pool of test 

items to which the item response model has been applied. Invariant 

i 

item and examinee ability parameters, as they are called, are of 
immense value to measurement specialists. 

Today, item response theory is being used by many of the large 
test publishers, state departments of education, and industrial and 
professional organizations, to construct both norm-referenced and 
criterion-referenced tests, to investigate item bias, to equate tests, 
and to report test score information. In fact, the various 
applications have been so -successful that discussions of item response 
theory have shifted from a consideration of their advantages and 
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disadvantages in relation to classical test models to consideration of 
such matrers as model selection, parameter estimation, and the 
determination of model-data fit. Nevertheless, it would be misleading 
to convey the impression that issues and technology associated with 
item response theory are fully developed and without controversy. 
Still, considerable progress has been made since the seminal papers by 
Frederic Lord (19.52. 1953). It would seem that item response model 
technology is more than adequate at this time to serve a variety of 
uses (see, for example. Lord 1980) and there are several computer 
programs available to carry out item response model analyses (see 

Hambleton & Cook, 1977). 

The purposes of this paper are to addres^^ (1) the measurement 
philosophy underlying item response theory, (2) the assumptions ^ 
underlying one of the more popular of the item respo'nse models, the 
three-parameter logistic model, (3) the strengths and weaknesses of 
the three-parameter model, and present gaps in our knowledge of the 
model, (4) several promising three-parameter model applications, (5) 

extensions and new applications of the model, and' (6) several 

controversies. 

2. Measurement Philosophy 

There are many wel 1 -documented shortcomings of standard testing 

and measurement technology.! Fqr one, the values of such useful item 

statistics as item difficulty and item discrimination depend on the 



1 "Standard testin g and measurement technology" refers to commonly 
used methods and techniques for test design and analysis. 
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particular examinee samples in which they are obtained. The average 
level of ability and the range of ability scores in an examinee group 
influences the values of the item statistics, often substantially. 
This means that the item statistics are only useful when constructing, 
tests for examinee populations which are very similar to the sample of 
examinees in which the item statistics were obtained. Another 
sho»;tcoming of standard testing technology is that comparisons of 
examinees on an ability measured by a set of test items comprising a 
test are limited to situations where examinees are administered the 
s^me (or parallel) test items. But, a problem is that many 
achievement and aptitude tests are (typically) suitable for 
middle-ability students and so the tests do not provide very precise 
estimates of ability for either high- or low-ability examinees. 
Increased test score validity without any increase in test length can 
be obtained if the test difficulty is matched to the approximate 
ability level of each examinee. But, when several forms of a test 
which vary substantially in difficulty are used, the task then of 
comparing examinees becomes more complex because test scores, only, 
cannot be used. For example, two examinees who perform at a '50% level 
on two tests which differ substantially in difficulty cannot be 
considered equivalent in ability, but how different are they in 
ability? And, how can the ability levels of two examinees be compared 
when they receive different scores on tests which vary in their 
difficulty? 

Another shortcoming of standard testing technology is that it 
provides no basis for determining what a particular examinee might do 
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when confronted with a test item. Such information 1s<n^essary, for 
example, if a test designer desires to predict test score v 
characteristics in one or more populations of examinees or to design 
tests with particular characteristics for certai n* populations of 
examinees. In addition to the three shortcomings of standard testing 
technology mentioned above, standard testing technology has failed to 
provide satisfactory solutions to many testing problems: For example, 
the design of tests, identification of biased items, and the equating 
of test scores. For these and other reasons, psychometricians have 
been investigating and developing more appropriate theories of mental 
' measurements . 

Item response theory purports to overcome the shortcomings of 
classical or standard measurement theory by providing an ability 
scale on which examinee abilities are independent of the particular 
Choice of test items from the pool of test items over which the 
ability scale is defined. Ability estimates obtained froai different 
item samples for an examinee will be the same except for measurement 
errors. This feature is obtained by incorporating information about 
the items (i.e., their statistics) into the ability estimation 
'process. Also, item parameters are defined on the same ability 
' scale*' They are, in theory, independent of the particular choice of 
examinee samples drawn from the examinee pool for whom the ite^ pool 
is intended although errors in item parameter estimation will be group 
dependent. More will.be said about this point later. Again, item 
parameter invariance across samples of examinees differing in ability 
is achieved by incorporating information about examinee ability levels 
into the item parameter estimation process. Finally, by deriving 



standard errors associated with the ability estimates, another of the 

criticisms of the classical test model can be overcome. 

In summary, the goal of item response theory is to provide both 

invariant item statistics and ability estimates. These features will 

be obtained when there is a reasonable fit between the chosen model 

and the data set. Through the estimation process, items and persons 

are placed on an ability scale in such a way that there is as close a 

f 

relationship as possible between the expected examinee probabilities 
for success on test items obtained from the estimated item a^^ ability 
parameters and the actual probabilities of performance for examinees 
positioned at each ability level. Item parameter estimates and 
examinee ability estimates are revised continually until the maximum 
agreement possible is obtained between predictions based on the 
ability and item parameter estimates and the actual test data. 

The feature of item parameter invariance can be observed in 
Figure 1. .In the upper part of the figure are three item 
characteristic curves (ICCs); in the lower part are two distributions 
of ability. When the chosen model fits the data set the same ICCs are 
obtained regardless of the distribution of ability in the sample of 
examinees used to estimate the item parameters. Notice that an ICC 
provides the probability of examinees at a given ability level 
answering each item correctly but the probabi,lJty value does not 
depend on the number of examinees located at the ability level. The 
number of examinees at each ability level is different in the two 
distributions. But, the probability value is the same for examinees 
in each ability distribution or even in the combined distribution. Of 
course suitable item parameter estimation will require a heterogeneous 
distribution of examinees on the ability measured by the test. 

. 156 
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It is possible that to some researchers the property of item 
invariance may seem surprising and unlikely to be obtained in 
practice, but it is a property which is obtained whenever we study, 
for example, the linear relationship (as reflected in a regression 
line) between two variables, X and Y. The hypothesis is made that a 
straight line can be used to connect t*ie average Y scores conditional 
on the X scores. When, the hypothesis of a linear relationship is 
satisfied, the same linear regression line is expected regardless of 
the distribution of X scores in the sample drawn. Of course proper 
estimation of the line does require that a suitably heterogeneous 
group of examinees be chosen. The same situation arises in estimating 
the parameters for the item characteristic curves which are also 
regression lines (albeit, non-linear). 
3. Assumptions 

When fitting an item response model to a test data set, 
assumptions concerning three aspects of the data set are commonly made 
(Lord, 1980; Wright & Stone, 1979); These three assumptions will be 
introduced next. 

Dimensionality. It is commonly assumed that only one ability is . 

' ..-^ 

being measured by a set of items in a test. Of course, this 

assumption cannot be strictly met because there are always many 

cognitive, personalHy, and test-taki ng factors which impact on test 

performance, at least to some extent. These factors might include 

level of motivation, test anxiety, ability to work quickly, knowledge 

of the correct use of answer sheets, and other cognitive skills in 

addition to the dominant one measured by the set of test items. What 
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Figure 1. A diagram showing the independence of the 

shape of item characteristic curves from the 
underlying ability distribution. 
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is required for this assumption to be met adequately by a set of test, 
ddit^ is, a ^"dominant" component or factor which influences test 
performance. This dominafit component or factor is referred to as the 
ability measured by the test. This is the ability on which examinees 
are being me'asured. All other contributing factors to test 
performance are defined as errors. 

Item response models in which a single ability is presumed 

I*, 

sufficient to explain or account for examinee performance are referred 
to as uni dimensional models. Those models in which it is assumed that 
more than a single ability is necessary to account for examinee test 
performance are referred to as multi -dimensional models. These latter 
models are complex, and to date, not wel 1 -devel oped. 

Principle of local independence . There is an equivalent 
assumption to the assumption of unidimensionality known as the 
assumption of the principle of local 1 ndependence^ (Lord .& Novick, 
1968; Lord, 1980). In words, the assumption requires that the 
probability of an examinee answering an item correctly (obtained from 
a one-dimensional model) is not influenced by his/her performance on 
other items in a test. -When an examinee learns information from one 
test item which helps him or her on other test items the assumption is 
violated. What the assumption means then is-, that only the examinee's 
ability and the characteristics of the test item related to the 
dominant trait measured by the test influence performance. 

Suppose we let Uj be the response of a randomly chosen examinee 
on items j (j=l, 2, n), and Uj=i, if the examinee answers the 

^ Actually the equivalence only holds when the principle of local 
independence is defined in the one-dimensional case. 
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item correctly, and Uj=0, if the examinee answers the item 
incorrectly. Suppose also we let the symbols, Pj, and Qj (Qj=l-Pj) 
denote the probability of the' :e^"2iminee answering the item correctly 
and incorrectly, respecti vely. the assumption of the principle of 
local independence in mathemati car terms can then be stated ir^ the 
following way: ^ 

Prob = ui, U2 = U2, ••• . ^n "n) 

= pui q1-ui pU2 q1-"2 ... p;;n qJ"" 

n Ui l-Ui 
= n P.J Q. J 

In words, the assumption of local independence in the one 
dimensional case requires that the probability of any response pattern 
occurring for an examinee is given by the product of probabilities 
associated with his/her successes and/or failures on the test items. 
The probabilities are obtained from a one-dimensional model. 

Mathematical form of the item characteristic curves . An item 
characteristic curve is a mathematical function that relates the 
probability of success on an item to the ability measured by the set 
of items contained in the test. There is no concept comparable to the 
notion of an item characteristic curve in standard test technology. A 
primary distinction among different item response models is in the 
mathematical form of the correspondi'ng item characteristic curves. It 
is up to the user to choose one of the many mathematical forms for the 
shape of the item characteristic curves. In doing so, an assumption 
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about the items is being made which can be verified later by how well 
the chosen model "explains" the observed test results. 

Each item characteristic curve for a particular item response 
model is a member of a family of curves of the same general form. The 
number of parameters required to describe the item characteristic 
curves in the family will depend on the particular item response 
model. With the three-parameter logistic model, statistics which 
correspond approximately to the notions of item difl(iculty and 
discrimination (used in standard testing technology), and the 
probability of low-abil-|^ty examinees answering an item correctly, are 
used. The mathematical expression for the three-parameter logistic 
curve is: n u \ 

(1) Pg(e) = cg (Ucg) , g=l, 2, -.^^ n, 

1+e 9^ g 



where: 



Pq( 6) = the probability that an , examinee with ability level 
^ answers item g correct!^, 



and 



bg = the item g difficulty parameter, 

ag = the item g discrimination parameter, 

= the lower asymptote of an ICC representing the 
^ * probability of success on item g for low-ability 
exami nees, 

D = 1.7 (a scaling factor), 

n = the number of items in the test. 
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The parameter Cg is the lower asymptote of the item 
characteristic curve and represents the probability of examinees with 
low ability correctly answering an item. The parameter Cg is included 
in the model to account for test response data ^t the low end of the 
ability continuum, where among other things, guessing is a factor in 
test performance- It is now common to refer to the parameter Cg as 
the pseudo-chance level parameter in the model • 

Typically, Cg assumes values that are smaller than the value that 
would result if examinees of low ability were to guess randomly to the 
item. As Lord (4974) has noted, this phenomenon can probably be 
attributed to the ingenuity of item writers in developing "attractive" 
but incorrect choices. For this reason, Cg is no longer called the 
"guessing- parameter". To obtain the two-parameter logistic model from 
the three-paraiffet«r lagistic model, it must be assumed that the 
pseudo-chance level pararneters have zero-values. This assumption is 
most plausible with free respons^e items but it xan- often be 
approximately met when a test is not too difficult for the examinees. 
For example, this assumption may be met when competency tests are 
administered to students following effective instruction. Perhaps the 
most popular of the present item response models is the lone-parameter 
logistic roodel (or commonly named as the "Rasch Model" after Georg 
Rasch the discoverer of the model). It can be obtained from the 
three-parameter logistic model by assuming that all items have 
pseudo-chance level parameters equal to zero and by assuming all items 
in the test are equally discriminating. Also, the one-parameter 
model, or Rasch model as it is commonly referred to, can be produced 
from a different set of measurement principles and assumptions. 
Readers are referred to Choppin (in this volume) for an alternate 



f / 

development of the filasch model- The viability of these assumptions is 
discussed by Hambleiton et al . (1978). 

Item characteiristic curves for the latent linear model ^ and the 
one-, two-, and thfee-parameter logistic models are shown in Figure 
2. Readers are referred to Hambleton (1979), Lord (1980), and Wright 
and Stone (1979) lor additional information about logistic test 
models. j 

4 . Strengths, ijeaknesses. and Gaps 

The exploration of item response models and their application to 

I 

eductional testijng and measurement problems has been under study for 
about fifteen yiars now. Certainly there are many problems requiring 
resolution but ,^nough is known about item response models to use them 
successfulli^ iri solving many testing problems (see Lord, 1980; 
Hambleton, 1983),, Item response models, when they provide an accurate 
fit to a data set, and in theory, the three-parameter logistic model 
will fit a data set more accurately than a logistic model with fewer 
item parameteirs, can produce invariant item and ability parameters , 
described ea/lier. Some of these promising applications will be 
described ir| the next two sections (also see, Hambleton, 1983). 

On the/negative side, the three-parameter model is based upon 
several stjong assumptions. (Of course, the one- and two-parameter 
logistic niod^^s are based on even stronger assumptions.) When these 

assumptio/is are not met, at least to an approximate degree, desirable 

/ 

T. The/ item characteristic curves for the latent linear model are of 

the/ form: 

Pg( e) = bg + ag 6 • 

i 
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(a) latent: linear curves 




(I)) one-parameter logistic curves 





(c) two-parnmetcr lo^.^istic curves (cl) thincc-pnranneter logistic curves 



Figure 2. Examples of item characteristic curves, 
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"features expected from applying the three-parameter model will not be 
obtained. Other weaknesses, presently, of the three-parameter model 
are (1) the need of rather large numbers of items and exami nees.for_ 
proper item parameter estimation, (2) the relatively high computer, 
costs for obtaining item and ability parameter estimates, and (3)'^the 
difficulties inherent in interpreting a complex model to test 

° practitionersi 

On the first point. Lord (1980) suggested examinee sample sizes 
in excess of 2,000 are needed. Perhaps Lord is overly conservative in 
his figure but it does appear that sample sizes in excess of .&00 or 
700 a're needed, and a disproportionate number of examinees near the 
lower end of the ability seals so that the c parameters can be 
estimated properly. Because of the required minimum sample sizes, 
small scale measurement problems (e.g,, teacher-made tests) cannot 
properly be addressed with the three-parameter model. With respect to 
the second point, it is common to report high costi associated with 
using LOGIST although there is evidence that the LOGIST program will 
run substantially faster and cheaper on _some "computers. Hutten (1981) 
reported an average cost of $69 to ruh^25 data sets with 1,000 
examinees and 40 test items on a CYBER -US,., ($800/hour for CPU time). 
Finally, the untrained test developer will have difficulty working 
with three statistics per item but as CTB/McGraw-Hill has shown i'n 
building the latest version of the -California Tests of Basic Skills, 
test editors can be trained to successfully use the additional 
information provided by the three-parameter model (Yen, 1983). 

There is (at least) one practical shortcoming of the three-i 
parameter model and its applications: There does seem to be a 
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shortage of available computer programs to carry out a three-parameter 

logistic model analysis. The most readily available program is 

LOGIST, described' by Wingersky (1983) and Wingersky, Barton, and Lord 

(198?^. c The most readily available version of this program runs on 

IBM equipment although there" is evidence that the program may run 

substantially faster on other computers. Additional investigation of 

^ this finding is needed along with on-going studies to try and speed up 

the convergence of ^timates. . In addition, there may be other ways to 

Improve the estimation process. Swaminathain and Gifford (1981) have 

I obtained very promising results with Bayesian item and ability 

parameter .estimates. Their results compare favorably with results 

from LOGIST and they can -be obtained considerably faster and more 

cheaply than the same estimates obtained with LOGIST. 

There are (at least) three areas in which we lack full 

understanding of item response iradels. First, additional robustness 

studies with the one- and two-parameter logistic models are needed and 

with respect to a number of promising applications. What is the 

practical utility of the three-parameter model^in comparison to the 

on.e- and two-parameter models? Second, appropriate methods for 

testing model assumptions and determining the goodness of fit between 

a model and a data set are needed. Hambleton and his colleagues 

(Hambleton, 1980; Hambleton, Murray, & Simon, 1982) have made a 

promising start by organizing many of the present methods and 

developing several new ones. Much of t^eir work involves the use 

graphs, replications, residual analyses and cross validation 

procedures. More work along the same general lines would seem 
\ • . 
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desirable. Third, there is a great need for persons to gain 
experiences with the three-parameter model and to share their new 
found knowledge and experiences with others. 
5. Appl ications^ 

In this section, several promising applications of the 
three-parameter logistic model wi\l be described briefly: Item 
banking, test development, criterion-referenced^ testi ng , item bias, 
and adaptive testing. Other applications of the three-parameter model 
are discussed by Hambleton et al . (1978), Lord (1980), and Hambleton 

« 

(1983).. 

Item banking . The development of criterion-referenced testing 
technology has resulted in increased interest in item banking 
(Choppin, 1976). An Item bank Is a collection of test items, "stored" 
with known Item characteri^sti cs. Depending on the intended purpose of 
the 'test, items with desired characteristics can be drawn from the 
bank and used to construct a test with known properties. 'Although 
classical item statistics (item di ff iculty and dis-cri mi nation) have 
been employed for this purpose, they are of limited value for 
describing the items in a bank because these statistics are dependent 
on the particular group used in the item calibration process. Latent 
trait item parameters, however, do not have this limitation, and 
consequently are of much greater use in describing test items in an 
item bank (Choppin, 1976). The invariance property of the latent 
trait item parameters makes it possible to obtain item statistics that 
are comparable across dissimilar groups. Since the item parameters 
depend on the ability scale, it is not possible to directly compare 

t 

t 

^ Some, of the material in this' section is taken from and/or edited < 
from a paper by Hambleton et al . (1978)i 
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latent trait "'tt^ parameters derived from differnt groups of examinees 
until the ability scales are equated in some way. Fortunately, the 
problem is not too hard to resolve since Lord and Novick (1968) have 
shown that the item parameters in the two groups are linearly 
related. Thus, if a subset of calibrated items is administered to 
both groups, the linear relationship between the estimate^af the item 
parameters can be obtained by forminc|^J:wO'-^^1ra'te bivariate plots, 
one establishing the relationship between the estimates of the item 
discrimination parameters for. the two groups, and the second, the 
relationship between the estimates of the item difficulty parameters. 
Having establi shed the linear relationship between item parameters 
common to the two groups, a prediction equation can then be used to 
predict item parameters for those \tems not administered to the first 
group. In this way, all i-tem^ parameters can be equated to a common 
group of examinees and corresponding ai^ility scale. One large test 
publishing company, the California Test Bureau/McGraw-Hill, presently 
customizes tests for school districts wih^ items calibrated using the 
three-parameter logistic modeK 

Test development . The three-parameter^inodel is^presently being 
used by a number of organizations in test development (e.g., 
CTB/McGraw-Hill , ETS). The three-parameter model provides the test 
developer with not only sample invariant item parameters but also with 
a powerful method of item selection (Birnbaum, 1968). This method 
involves the use of information curves, i.e., items are selected 
depending upon the amount of information they contribute to the total 
amount of information supplied by the test (LorM, 1980)1. One of the 
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usefu? features of item information curves is that the contribution of 
each item to the test information function can be determined without 
knowledge of the other items in the test. When standard testing 
technology is applied the situation is very different. The 
contribution of any item to such statistics as test reliability cannot 
be determined i rfdependently of the characteristics of all the other 

items^in the test. 

Lord (1977) outlined a procedure for use of item information 
curves to build a test to meet arty desired set of specifications. The 
procedure employs a pool of calibrated items, with accompanying 
Information curves, such as might be obtained froni the item banking 
methods described earlier. The procedure outlined by Lord consists of 

the following steps: . ^ 

1. Decide on the shape of the desired test information curve. 
Lord (1977) calls this the target information curve . 

2. Select items with item information curves that will fill up 
the hard-to-fill areas under the target information curve. 

3. After each item is added to the test, calculate the test 
\^^/^ information curve for the selected test items. 

4. Continue selecting test items until the test information 
curve approximates the target information curve to a 
satisfactory degree. 

An example of the application of this technique to the development of 
tests foP differing ranges of ability (based on simulated data) is 
giveiT-by Hambleton (1979). 
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,^ • Criterion-referenced testing . A principal use of a criterion- 

referenced test is to estimate an examinee's level of mastery (or 
"ability") on an objective. Thus, a straightforward application of 
' the three-parameter model would' produce examinee ability scores. 
Among the advantages of th appl i cati on 'woul d be that items, could be 
sampled (for example, at random) from an item pool for each examinee, 
and all examinee abi'lity estimates would be on a common scale. A 
potential problem with this application, however, concerns the 
estimation of ability with relatively short tests. 

Since item parameters are invariant across groups of examinees, 
it would be possible to construct criterion-referenced tests to 
"discriminate" at different levels of the ability continuum. Then, a 

r 

test developer might select an "easier" set of test items for a pre-' 
• test than a posttest, and still be able to measure "examinee grovfth" 
by estimating examinee ability with the three-paramete model at each • 
test occasion on the same ability scale. This cannot be done with 
classical approaches to test development and test score interpreta- 
' tion. If we had a good idea of the likely range of ability scores for 
the examinees, test items could %e selected so as to maximize the test 
information in the region of ability for the examinees being tested. 
The optimum selection of test items would contribute substantially to 
the precision with which ability scores were estimated. In the case 
of criterion-referenced tests, it is common to observe substantially 
lower fest p^erformance on a pretest than on a posttest; therefore, the 
test ^.onstructor could select the easier test items from the domain of 
items measuring an objective for the pretest and more difficult items 
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could be selected for the posttest. This would enable the test 
constructor to maximize the precision of measurement of each test in 
the region of ability where the examinees would most likely be 
lo<:ated. Of course, if the assumption about the location of ability 
scores was not accurate, gains in precision of measurement would not 
be obtained. 

The results reported in Tables 1 and 2 (from Hambleton, 1979) 
show clearly the advantages of "tailoring" a test to the ability level 
of a group. Of course, the potential improvements depend on the 
validity of a test developer's assumption about the examinee ability 
distribution. If he or she uses an incorrect prior distribution as a 
basis for designing a test, the resulting test will certainly not have 
the desired characteristics. 

Item bias^ Identifying biased items in a test usually involves 
comparing the performance of the subgroups of interest (e.g.. Blacks, 
Hispanics, and Whites) on the test items. The problem that arises is 
that differences among t^e subgroups' due to bias is confounded with 
any true differences in abilities amtng the subgroups. Needed is an 
item bias detection method that can control for true ability 
differences. Via a three-parameter model analysis, it is possible to 
compare corresponding item characteristic curyes. At each ability 
level, independent of the proportion of examinees in each subgroup who 
are located at the ability level, the expected proportion of successes 
in each subgroup ,obtai ned from the ICCs, can be^compared. The ICCs 
estimated in each group, in theory, do not depend upon the underlying 
ability distributions. Any differences in the curves, beyond the 



17 : 



Table 1 



Test Information Curves and Efficiency for Three Criterion-Referenced 
^ Test Designs From a Domain of Items of Equal Discrimination 
and Paeudo--chance Levels Equal to ,20 



Ability 
Level 


Test 
"Wide Range 
Form" 


Information 
"Easy Form" 


Curves 
"Difficult 
Form" 


Efficiehcy (Relative 
the "Wide Range Form' 
"Easy Form" "Difficult 


to 
) 

Form" 


Change 
Test 
"Easy Form' 


in Effective 
Length 
"Difficult Form' 


-3.0 


.22 


.36 


.07 


1.63 


.31 




63% 


-69% . 


-2.0 


.86 


1.31 . 


.36 


1.53 


.42 




53% 


-58% 


-1.0 


2.08 


2.81 


1.31 


1.35 


.63 




35% 


-37% 


0.0 


3.04 


3.29 


2.81 


1.08 


.92 




8% 


-8% . 


1.0 


2.76 


2.28 


3.29 


.82 


1.19 




-18% 


19% - 


2.0 


1.69 


■ 1.12 


2.28 


.66 


1.35 




-34% 


35% 


3.0 


.79 


,46 


1.12 


.59 


1.42 




-41% 


42% 



Table 2 
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Test Information Curves and Efficiency for Three Criterion-Referenced Test 
Designs From a Domain of Items with Varying Discrimination Indices 
and Pseudo-thancc Levels Equal to .20 



Ability 
.Level 


Test 
"Wide Range 
Form" 


Information 
"Easy Form" 


Curves 
"Difficult 
Form" 


Efficiency 
the "Wide 
"Easy Form" 


(Relative to 
Range Form") 
"Difficult Form" 
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usual sampling errors, can be attributed to differential subgroup 
responses to the items, i.e., bias. It is becoming routine practice 
for several large test publishers to investigate bias in test items 
with the aid of the three-parameter logistic model. Since the 
three-parameter model often provides a somewhat better fit to test 
data at the lower end of the ability continuum (Hambleton et al . , 
1982) than less general logistic models, the three-parameter model may 
be more useful than other logistic models for studying bias. 

Adap tive testing . Possibly the first and most wel 1 -devel oped 
application of the three-parameter logistic model to date is adaptive 
testing (Lord, 1980; Weiss, 1980). In adaptive testing each examinee 
is administered a set of test items "tailored" or "adapted" to his/her 
ability level. Clearly, total test scores cannot provide an adequate 
basis upon which to compare examinees. Some examinees will be 
administered sets of test items which are substantially more difficult 
(or easier) than the test items administered to other examinees. By 
calibrating test items using the three-parameter logistic model in 
advance of the actual testing, and using the three-parameter model to 
estimate examinee ability levels, examinees can be compared even 
though the test items administered to different examinees may differ 
substantially in difficulty. Because of the ready availability of the 
computer, scoring difficulties associated with the use of the 
three-parameter model can be overcome easily. 

The U.S. military is firmly committed to the' use of adaptive 
testing with the three-parameter model in many of its testing 
programs. Presently a Jeasibility study is^being conducted along 
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with the preparation of plans for adaptive testing implementat^ion and 
evailuation of the total adaptive testing system. - 
6.1 Po,ssi'ble Extensions/New Applications 

j Numerous researchers are presently addressing the development of 

I 

ne| iteiTi response models. For example, Samejima (1979) is exploring 

i • 

th| development of multidimensionaj^ models in which item optiops are 
rahked based on their relationship to ability, and characteristic 
curves are produced for each option. McDonald (1982) has providq^d a 
qeneral formulation for genera^ting a wide range of multidimensional 
linear and non-linear polycf/otomous item response modelsy Bock, 
vy, and Woodson (1982) have described a two-parameter item 
risponse model which can handle continuous data and where the unit of 
afialysis can be a group (e.g., the classroom or a school). T)iij.:Si model 
wjill be especially useful in program evaluation investigations. A 
minor variation of the three-parameter modol which appears to have 
some utility is a model in which a common value of the c parameter is 
used for all test items (Wirvger^ky, 1983). This revised 
three-parameter model will receive some use in the coming year^'. A 
four-pa^rameter logistic model has also been suggested (the fourth 
parameter is the upper asymptote) but it appears to have very limited 
practical usefulness. All of these new models can be viewed as 
modifications/extensions of the three-parameter logistic model and 
they will undoubtedly receive study from researchers in the coming 

m 

year^^. 

Because of the newness of the IRT area, all applications of the 
three-parameter model might legitimately be classified as new. For 
the purposes of this paper, "new applications" will be those which to 
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date have not been published. Two new applications, then, of the 
three-parameter model to the problems of item selection (Hamb'leton & 
de Gruijter, 1983) and score prediction (Hambleton & Martois, 1983) 
will be described briefly next. ' , 

Item selection . Item response models appear useful to the 
problem of item selection because they lead to item statistics which 
^re referenced to the same scale on which examinee abilities ar6 ^ 
defined. In addition, it should be noted that ip-^rovides a \ 
procedure for placing a cut-off score w>i^h is normally set on a 
proportion-corriect scale defin^ aver a .domain of iteifls on the same 
scale as the tLt items and the examinees (Lord, 1980). Therefore, 
the usefulness of a test item for measurernent at arty point on the 
ability scale can be assessed. 

Hambleton and de Gruijter (1983) described a [line step procedure 
for selecting test items using three-parameter model item statistics, 
and via a computer simulation study showed l|he advantages, at least in 
the absence of errors associated with item parameter estimates, of • 
, item selection with the aid of IRT over a standard item selection 

procedure. ' .. 

Test score predictions . The concept of item banking has 
attracted considerable interest . in recent years from school -districts, 
state departments of education, and test publishing companies. When 
item banks consist of test items which are technically sound and 
validly measure the objectives or competencies to which they are 
referenced, the task of producing high quality tests is made 
considerably eastier. Item banks are most often used to construct 
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crUeri on-referenced tests (CRTs) or mastery tests or competency 
tests, as they are sometimes called. What is not' commonly available ^. 
for use with these CRTs are derived scores such as percentiles. 
Derived scores are not always valued but on o.ccasion they are required 
by school districts who receive federal funds (e.g.. Title I) for^^they 
must evaluate their funded programs with national norms (e.g., 
percentile scores). 

In theory, the problem faced by school districts who require 
information fur (1) diagnosing and monitoring student performance in 
relation to competencies and (2) normative scores for the comparison 
of examinees is easy to solve. Teachers can use their item banks to 
build classroom tests on an "as-needed" basis, and when the need 
arise?, they can administer any necessary commercially available 
standardized norm-referenced tests. But this solution has problems: 
(1) the amount of testing ti;re for students is increased, and (2) the 
financial costs of school testing programs is increased. On the other 
hand, when testing time is held constant, and norm-referenced tests 
are administered', there is less time available for instructionally 
relevant testing (i.eo, CRTs). A more sati sfactory '?alution would 
allow teachers to administer test items measuring objectives of 
interest in their instructional programs, and at the same time, allow 
for normative scores to be estimated from the test items which are 
administered. An often used solution of selecting a norm-referenced 
test to provide normative scores and criterion-referenced information 
through the interpretation of examinee performance on an item by item 
basis is not very suitable criterion-referenced measurement and will 
not insure that all competencies of Interest are measured in the test. 

11 \} 
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Hambleton (1980) suggested a posstble item response model 
solution to the .problem of providing both instrucfonal information and 
normative information from a single test. A latent ability scale to 
.which a large pool -of test items are refere/iced can be very useful in 
obtaining normative scores from tests constructed by drawing items 
from the pool. A norms table can be prepared from the administration 
of a sample of items in'the pool. Then the norms table can be used 
successfully with any tests which are constructed by drawing items 
from the pool. Local norms can be prepared' by districts who build 
their own item banks. A test publishing company probably would 
prepare national norms for selected tests constructed from their item 
banks . 

Hambleton and Martois (1983) recently finished a study in which 
it was found that both the one- and the three-parameter logistic 
models resulted in excellent predictions of how examinees performed on 
a norm-referenced test. Predictions were made from tests wUh items 
that were easier, comparable to, or harder than items in the normed 
test. Similar results were obtained in three subject areas at two 
grade levels. Further research along the same general lines seems 
highly de^rable because of ' the importance of the problem area. 
7. Controversies 

Perhaps like any emerging area, item response theory has 
generated considerable controversy and strong emotional feelings in 
support of one model versus another. Much of the debate has centered 
on the choice between the one- and three-parameter logistic models. 
There has also been some controversy surrounding the utility of 

ErJc 16 ^ 
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Bayesian estimators (Samejima ve'sus Novick and Sv^ami nathan ) and the" 
appropriateness of item response models for the analysis of aptitude' 
versus ach-ievement -tests. On this latter point there is some feeling 
that items on achievement t^sts are instructional ly sensitive and 
therefore item response model statistics will not be invariant in pre- 
and post-instructional groups. 

With respect to the choice of the cne- versus the three-parameter 
logistic ^odel , a number of questions have arisen: 

1. What is the effect of boundary constraints placed on item 
and ability ^parameter estimates obtained with LOGIST? 

2. What is the practical utility of the three-parameter model? 
In most practical settings, won*t the two models produce 
highly similar results? 

3. What is the additional cost of running a tl^ree-parameter 
model analysis and is the practical utilitjf of the gains . 
that accrue worth the financial costs and tn^ added 
complexity which results? 

4. Since examinees can guess the answers to multiple-choice 
test items, the three-parameter model should be selected on 

t the basis of this a priori consideration (Traub, 1983). 

5. How well do the item response models fit any data sets? 
This point is in dispute because > ^many of the present 
goodness of fit statistics have been found to-be 
inappropriate (e.g., see papers by Wollenberg, 1980; Divgi, 
1981). 

These and other questions will undoubtedly be addressed in the coming 
years. Answers will contribute to our knowledge o^ the three- 
parameter logistic model and the situations in which the model should 
be used. 
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MEASURING ACHIEVEMENT WITH LATENT STRUCTURE MODELS 

Rand R. Wilcox 
Center for the Study of ^valuation 
University of California, Los Angeles 

1. MEASUREMENT PHILOSOPHY • 

The basic assumption in latent class models designed to measure 
achievement is that an examinee can be described as knowing or not know- 
Ing the answer to a test item, and that inferences about an examinee's 
ability level should take this notion into account .^^Tfie' goal s of an 
n-item test might be to determine how many of the items an examinee knows, 
which items are known or which are not known, or what proportion of items 
among a domairi of i^ems are known. The problem is that examinees might 
give the correct response when they do not know, or they might carelessly 
give the wrong response when they know. Latent class mode'^s are an at- 
tempt to measure and correct the effects of these errors when addressing 
a particular measurement problem. Even if some other model is ultimately 
preferred, such as a latent trait mojdel , latent class models are poten- 
tially useful - 

Currently it appears thai correcting far guessing is more important 
than might have been expected. Moreover, assuming random guessing seems 
to be an unsatisfactory solution. Consider, for example, the problem of 
determining the length of a criterion-referenced test where the goal is 
to determine whether an examinee's percent co>-rect true score or domaij 
score^ p, is above or ::;elow some known constant Pq. If Pq=-8 and n=29 
items are used, the probabil ity . of correctly determining whether p>Pq is at 
least .9 when p>.9 or p<.7, and when the binomial error model is assumed. 
If random guessing is assumed, nearly 200 items are needed (van den Brink 
and Koele, 1980), and i/ one allows for the possibility that .guessing is 
not at random, over 2,600 items are required to attain the same level of 



accuracy (Wilcox, 19*80). In some cases guessing might be nearly random, 
but there is empirical evidence that this is generally not the case 
(Coombs et al . , 1956; Bliss, 1980; Cross'and Frary, 1977; Wilcox, 1982a, 
1982b). 

Another way of describing the measurement philosophy of jatent class 
models is that an exciminee's test score is a function, in part, of the 
distractors that are used, and that it is importan"^ to take this effect 
into acjount. In ti^e past this problem was ign6red, probably because 
there were no reasonable ways of dealing with it, and because it; was not 
clear just how serious this problem was. Now>, however, there are several 
ways of measuring and correcting the effects of distractors. It might 
appear that some latent trait models deal with guessing, but in fact 
latent trait models ignore the errors that are of concern here. Thus, 
these errors might have a serious effect on how latent trait models are 
used and interpreted. Wainer and Wright (1980) as well as Mislevy and 
Bock (1982) examined certain aspects of how guessiVig affects latent trait 
models, but the type of guessing examined here is different. ^ 

2. THE^MODELS AND THEIR .ASSUMPTIONS 

Generally latent class models are based on assumptions about how 
exaininees behave when responding to an item, or how items are related 
to one another^ or the manner in which tests are administered. While 
a general description of latent class models is possible, such a des- 
cription is not given here. Instead attention is focused on those models 
that seem to have the most practical value. 
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' A Latent Structure Model |or Answer-Until -Correct Tests 

This section assumes that an examinee responds to a multiple-choice 
test item according to an answer-unti 1-correct (AUC) scoring procedure. 
This means that if an examinee chooses an incorrect response, another 
response is chosen, and this process continues until the correct response 
is identified. 

AUC tests are easily administered in the classroom using especially 
designed answer sheets where the examinee erases a shield corresponding 
to a particular alternative. (These answer sheets are available commer- 
cially, for example, through Van Valkenburg, Nooger and Neville in New 
York, N.Y., and they are relatively inexpensive.) If the letter under 
' the shield indicates an incorrect response, the examinee erases another 
shield, and this continues until tne correct shield is erased. 

Consider a population of exair.inees, and let be the proportion 
of the examinees who can eliminate i distractors from consideration. 
That is, because of partial information, some of the examinees will rule 
out some of the distractors without knowing the correct response. If 
there are t alternatives from which to choose, and if the examinee can 
eliminate t-1 distractors from consideration, the examinee is said to ; 
know the correct response. Thus, is the probability that a randomly 

sampled examinee knows the correct response. Note that no distinction is 
made between examinees who can eliminate a>t^lj^ distractors via partial 
information and those that know. In other words, an examinee might choose 
the correct response, not because the correct answer is known, but because 
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the test constructor was unable to produce at least one effective distrac- 
tor. Thus, it is assumed that at least one effective distractors is being 
used, and presumably this problem can be minimized by choosing t to be 
reasonably large. Of course the crucial step is finding someone who can 
write effective distractors. 

As alluded to earlier, it is assumed that among the examinees who 
do not know, some might be able to eliminate one or more distractors from 
consideration via partial information. It is further assumed that once 
these distractors are eliminated, the examinee guesses at random among 
the alternatives that' remain. Hence, if p. is the probability of a correct 
response on the i— attempt of the item Ci = lj. --st), 

p^^'t^ ^/('-^' ' 

For example, if t=3 / 

P2 = ^o^-^ ' 

P3 = V3 • 

In general, the proportion of examinees who know the correct response is,, 

= Pi - P2- ^^-^^ 
The model implies that 

p^ > P2 > •••>Pt' ^^-^^ 
and this c.an be tested by applying results in Robertson (1978). Empirical 
investigations (Wilcox, 1982a, 1982b) suggest that (2.2) will usually hold. 
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The next section describes how one might proceed when (2.2) appears to 
be unreasonable. 

For N random'^y sampled examinees, let x^. be the number who get the 
correct response on the i— attempt. Then the x.*s have a multinomial 

Tm^ X, X, 

distribution give by 



JJ] ... Pt ^ where [jjj = N!/(x^! ... x^!), 



^ x^. = N, 0 ,< p^. < 1, and ^ p^. = 1. An unbiased maximum likelihood es- 
timate of p^. is just x^N, and so 

is a maximum likelihood estimate y the proportion of examinees 

who know the correct response. Semantical ly, if we compute the propor- 
tion of examinees who get the item correct on the first attempt, and then 
subtract the proportion who get it right on the second attempt, we have 
an estimate of the probability that the typical exam'inee will know the 
answer. ^ 

Note that ^ ^ given by (,2.3) can be negative, but ^ is positive 
^ when the model is assumed to be true. This can be corrected by simply 

estima-ting ^ ^^^^ vihen ^^^^ ^ 0. From Barlow et al . (1972), a 

maximum likelihood estimate of ^^^^ ^"^^^^ assumption that (2.2) holds 
can be had by applying the pool -adjacent-Violators algorithm. 

A Misinformation Model 



The previous section assumed that the inequality in equation (2.2) 

■ T 



is true, but experience indicates that occasionally this will not be the 
case. In this event a misinformation model may be appropriate. Of course 
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for some items an investigator might suspect a misinformation model is 
needed before any test data is collected in which case the results in 
this section might be appl i ed' wi thout testing (2.2). 

As will soon become evident, there is no specific misinformation 
model, but rather a class of models that might be used. The choice 
from among these models will depend on what seems to be a reasonable 
assumption about how examinees behave. At the moment there are no em- *n 
pirical procedures to aid a test constructor when choosing from among 
the various misinformation models. So far, however, this does not seem 
to be a serious problem. 

To better understand how to apply these models, consider the follow- 
ing test item. 

When a block of iron is heated until it is red hot, it 
gets bigger. If the iron weighs 20- lbs. at room temper- 
ature, how much will it weigh when red hot? 

1) 19 8 lbs. 21 20 lbs. 3) 20.1 lbs. 4) 20.5 lbs. 
5) 20.61 lbs. 

This item is^ similar to one investigated in W.ilcox (1982b) where the ex- 
aminees were approximately _14 years old. The point is that it seems rea- 
sonable to suspect that some examinees will choose from among the last, 
three alternatives because they believe the iron weighs more when it expands. 
- The goal then is to devise a model that takes this behavior into account. 
In this section it- is assumed that the examinees belong to one of 
three mutually exclusive groups: 1) they know the item, 2) they have 
misinformation, 3) or they do not know, do not have misinformation, and 
guess at random. For examinees with misinformation, it is also assumed ■ 
that they v>^ill choose c spec^q. incorrect alternatives before choosing 
the correct response. At the moment there is no empirical method for 
choosing q; this must be done ba^ed on what seems reasc|nable for the item 
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being used. For example, in the item described above, c=3 would be con- 
sidered. In some cases the resulting latent structure model can be 
.checked with a goodness-6f~fi t test, but as willbe seen this is not 
always the case. 

For the population o^ examinees bei ng -tested, let c be the propor- 
tion of examinees who knovi(, be the proportion who do not know, do not 
have misinformation and gu^ss at random, and let be the proportion 
who have misinformation. i\f an AUC scoring procedure' is used, and if 
is defined as before, then for c=3 and' t=5 

p^ = C + ' (2,4) 

P2 = , • ' (2.5) 

•P3 = 1 (2.6) 

P5 = , \ .- (2.8) 

Thys, c = P1-P2 as before and c is estimated with (x^-(x2 + X3 + Xg)/3)/N. 
The model can be tested with the, usual chi-squre test, and it gave a good 
fit to thg/data in Wilcox (1982bt 



More generally, for arbitral^y c. 



Pi = ^ + v^/t 



and 



' (2.9) 
Pen = ^2 ^ ^1/^ • - ^'-''^ 



p. = v./t, i 7^ 1, c + 1 . ■ , > (2,U) 
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Slignt generalizations of the model may be possible. Suppose, for 
example, e-3-and t-5, as in equations (2.4)-(2.8), but for examinees with 
misinformation, let be the proportion of examinees who choose the cor- 
rect response once c=3 alternatives are eliminated. Then Pg and p^ take 
the more general form 

and 

Now, however, a goodness-of-f i t test is no longer pos.sible because there, 
are zero degrees of fr^dom. 

Equivalent ^nd Hierarchically Related Items, and Related 
* Latent Structure Models 

In recent years, several investigators have proposed models 
based on the notion of equivalent or hierarchically related items^. Two 
items are said'to be equivalent if examinees know both or neither one. 
If in addition, there are examinees who know the first but not the second, 
the items are hierarchically related. As argued by Molenaar (1981), 
clearly there are situations where it may be difficult or impossible to 
generate eqivalent item§> However, experience suggests that there are 
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situations where one of these assumptions might be rec.sonable (e.g., 
Macready and Dayton, 1977; Harris and Pearlman, 1978; Harris et al . , 1980). 

It should be mentioned that in some instances a test consisting of 
hierarchically related items is considered to be desirable and the goal 
is to measure the. extent to which a test has this property. Put another 
,way, the goal is to determine the extent to which the items an a test 
form a Guttmart scale. One such measure was proposed by Cliff (1977). 
(See also Harni.sch and Linn, 1981, and the paper by MacArthur in this 



The simplest model consists of two equivalent items, and it arises 
as follows. Let ? be the proportion of examinees who know both items. 
In contrast to earlier sections, a conventional scoring procedure is used. 
That is, examinees get only one attempt at an item, and the item is scored 
either correct or incorrect. Let p • ■ be the probability of the response 
pattern ij (i=0,l; j=0,l) where a 0 means incorrect, and a 1 means correct. 
This, p^Q represents the probability of a correct-incorrect response for a 
randomly sampled examinee. If 3^ is' the probability of correctly guessing 
the response to the first item when the randomly sampled examinee does not 
know, and if ^2 ^'^ corresponding probability on the second item, and 
if local independence holds (i.e., given an examinee's latent state, the 
responses are independent) then 



vol ume. ) 



Pll 



ho 



Poi 



(l-dBi(l-32) 
(l-dB2 ^l-^l) 



Poo 



(i-d(i-3i)(i-e2)- 
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Solving for p^, and yields 

PlO 



^1 



PlO Poo 

Poi 
Poi Poo 



and 



c = 1 - Cpqi + Poo^^Pio ^oo^/Poo • 

If X is the number of examinees who have an ij response pattern, 
ij 

the unbiased maximum likelihood estimate of p^ • is p.- = x-^/N where, as 
before, N is the number of randomly sampled examinees. Thus, c can be 
estimated. 

An interesting feature of the equivalent item model is that it is. 
possible to include additional errors at the items level such as Pr(incor- 
. recti examinee knows) (Macready and Dayton, 1977). However, estimating 
the parameters usually requires iterative procedures that are typically 
implemented on a computer. Goodman (1979) describes one such procedure, 
and Macready and Dayton (1977) used the scoring method (cf. Kale, 1962). 

Testing Whether Two Items are Equivalent 

One way to check the assumption of equivalent items is to apply the 
usual goodness-of-fit test as illustrated by Macready and Dayton (1977) . 
For some cases, such as the equivalent item model descri be^bove, this 
cannot be done because there are zero degrees of fr^o^ 

An alternative and relatively simple test of whether two items are 
equivalent is possible using an answer-until -correct scoring procedure. 
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For a randomly sampled examinee let p.^ be the probability of a correct 
response on the ith of the first item arid. the jth attempt of the second. 
If two 'items are indeed equivalent, and if for example, t=3, it can be 
seen that 

Pl2 " ^21 " ^22 
Pis = P23 

and P31 = P23 • / \ 

For recent results on testing these/equal it\es, see Smith et al . (1979), 
and Wilcox (1982e). 

Hartke (1978) describes another approach based on latent partition 
analysis, and an i ndex j)roposed by Baker and Hubert (1977) might also be 
useful . *■ 

Hierarchically Related Items 

Dayton and Macready (1976, 1980) describe very general latent structure 
models for handling hierarchically related items. Again these models can 
be used to measure guessing, and they have the advantage of including 
other errors at the item level such as ? = Pr(i ncorrect |exami nee knows). 
The model for AUC tests essentially sets ^ = 0, but the practical impli- 
cations of this have not been established. 

As was the, case for equivalent items, estimating the parameters in 
the model requires iterative techniques. In some instances simple (closed 
form) estimates, exist Ce.g., Wilcox, 1980b), but these models make certain 
assumptions that may be unreasonable in many situations. 
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3. STRENGTHS AND WEAKNESSES OF LATENT CLASS MODELS 

Latent class models have three primary strengths. First, it now 
appears that one of two models can be used to explain the observed re- 
sponses to a multiple-choice test item (Wilcox, 1982b). These models 
are an oversimplification of reality (as are all models), but they seem 
to give a good approximation of how examinees behave when taking a 
test. Of course future investigations might reveal that more complex 
models are really needed, but so far this does not appear to be the 
case. 

The second strength is that many measurement problems can now be 
solved that were previously impossible to address. In particular, these 
models correct for guessing, or measure the effects of guessing which 
in turn improves the accuracy of tests and measurement techniques. 
Note that the nature of -guessing in latent class models is different 
from-the guessing parameter in latent trait models (Wilcox, 1982c). 

Third, even if some other model is ultimately preferred, a latent 
class model may be useful , for example, when estimating the item para- 
meters in a latent trait model. 

A weakness U^laJ^^ass models is that certain technical prob- 
lems still need to be solved. These incTude better ways of scoring an 
n-item test, testing the model used in Wilcox Cl982e), and finding a 
strong true-score model that is reasonable when the model in Wilcox 
Cl982a) gives a poor fit to data. Also, some examinees may give an 
incorrect response when they know, but the seriousness of this problem 
is not well understood. 



136 



6.13 



4." PRESENT AREAS OF APPLICATION 

This section outlines some of the measurement problems that can now 
be solved with latent class models. 

The Accuracy of an Item and the Effectiveness of Distractors 

In addition to estimating the proportipji of examinees who know the 
item, the latent structure models for AUG "tests can be used to es- 
timate the probability of correctly determining whether a typical examinee 
knows the item. More specifically, assume it is decided that an examinee 
knows the correct response if the correct answer is given on the first 
attempt (i.e., a conventional scoring procedure is used). For a randomly 
sampled examinee, the probability of correctly determining whether he/she 
knows is just T = I-P2 (Wilcox, 1981a), and this is estimated with t = I-X2/N. 
Note that when. (2.2) is assumed 0 < Pg < h, in which case h<T< 1. 
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The parameter t is a function of two important quantities. The first 
is the proportion of examinees who know the answer, i.e., and the 

second is the effectiveness of the distractors among the examinees who 
do not know. To see this more clearly, note that 

t (4.1) 

When c^^^ is close to one the item accurately reflects the true latent 
state of the examinees because presumably examinees who know will choose 
the correct response on their first attempt. A's "i°ves cl(^er to 
zero, the accuracy depends more on the effectiveness of the distractors. 
Thus, it may be important to determine how well distractors are perform- 
•trig among the examinees who do not know . 

' It can be shown that the distractors are most effective when guess- 
ing is at random which corresponds to 

D = p = = P. (^-2) 
CWilcox, 1981a). This suggests (4.2) be tested, ' and/or we estimate how 
"far away" the p. values are from the ideal case where (4.2) holds. 

Testing (2.3) can be accomplished by noting that the conditional 
distribution of X2,;..,x^ given x^ is multinomial with parameters N-x^ 
and p./Jl-p^), i = 2. ...,t. Thus, the ususal chi-square test can be 
applied. That is, compute 

x2 = I (Xi;(N-x^)/(t-l))^ (4.3) 
(N-x^)/(t-l) 

If is greater than or equal to the lOO(l-a) percentile of the chi-square 
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i 

distribution with t-?. degrees of freedom, reject the hypothesis that (4.2) 
holds. For recent results on using (4.3), see Chacko (1965), Smith et al . 
(1979), wai:€ox (1982e). 

Empirical results indicate that guessing will not be at random. Thus, 

K 

a more interesting question might be to determine whether the distractors 
are "close" to the ideal situation where (4.2) holds. The first steo in 
solving this problem is to choose' a measure of how unequal the p^- values 
are (i = 2,...,t). Many such measures have been proposed which have s-imilar 
properties (e.g., Marshall and Olkin, 1979; Bowman et al . , 1971). One 
of these is the entropy function which was used by Wilcox (1982a), and 
another is Simpson's measure of diversity (Simpson, 1949) given by 

I [Pi/(1-Pi)]^ 
i=2 

Writing (4.3) as 

it is seen that the usual maximum likelihood estimate of Simpson's measure 

of diversity, namely, I (x-/(N-xJ)^, is a simple linear transformation 

i=2 2 • 

of X^. Since is better known than Simpson's measure of diversity, X 

will be used here. 

2 

It is helpful to note that the smallest possible value for X^ is 

L =1=1 [(n-x,)(2r+l) - (t-l)r(r+l)] - n+x, (4.4) 
ri"" 

where r is the largest integer satisfying r(t-l) < n-x^ (Dahiya, 1971). 
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The maximum value is 

M = (n-x^)(t-2) ■ (4-5) 

(Smith et al . , 1979). The closer is to M, the more elective are the 

2 

distractors. Since L and M are known, the relative extent to which X 
is clos-e to M can be determined. In particular, 

E=CX^-L)/CM-L) 

measures the effectiveness of the distractors being used, where 0<E<1. ' 
If E=0, the distractors are as effective as possible in determining 
whether an examinee knows the correct response. As E approaches 1, the 
distractors become less effective. / 



Comparing Two Items 

If the AUC model is assumed, and if independent "estimates of^the p^ 
values for two items are available, it is possible to te^t-^ hypothesis 
that one of the items is at least as effective as the Second byNapplyVncj 
results in Robertson and Wright (1981). The null hypothesis of iVjte/est 

k k ^ . 

here is that . Y p./Cl-pJ > I p,7(l-pJ, k=2,...,t-2 where p 
i=2 ^ ^ ~ i=2 ^ ^ 
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is the p. value for the second item. Let and xr^ be the value of.x for 
two items. Another way of comparing two items is to test whether the 
first item is better than the second by testing whether Tj>t2- In effect 
this approach compares the overall effectiveness of the two items in terms 
of the population of examinees, while the approach previously described 
is to compare the effectiveness of the distractors among the exam- 
inees who do not know. 

Characterizing Tests 
Let T^. be the value of t for the i— item on an n-item test. A natural 

way of describing the accuracy of a test is to use "^'-.^^ '^i ' "'^^^'^ 

the expected number of correct decisions about whether a typical (randomly 

sampled) examinee knows the answer to^ the items on a test. If, for example, 

T = 7 and n = 10, then on the average, 7 correct decisions would be made 
^s 

.about whether an examinee knows the answer to an item, but for 3 of the 
items it would be decided that the examinee knows when in fact he/she 
does not. 

Estimating t is easily accomplished using previous results. In 

t h 

particular, for a random sample of N examinees, le\ x^j = 0 if the j — 
examinee ^ets the i— item correct on the second attempt; othevwise 
X. . = 1. Then 

= N-^ I -I X,, 



T 



i=l j=l 



is an unbiased estimate of t^. 
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The k Out of n Reliability of a Te. *: 

Once test data is available, the question arises as to how certain 
we can be that is large or small. That is, we want to estimate the 
p^^:^ > TQ)(cf. Tong, 1978). This problem is similar to one found in 
the engineering literature where the goal is to estimate the k out of n 
reliability of a system. Bounds on this probability can be estimated 
without assuming anything about cov(x^.j, X-j'j') (Wilcox, 1982e). 
The procedure is outlined below. 

.Let z-=l if a correct decision is made about whether a randomly 
sampled examinee knows the i^- item on a test; otherwise z.=0. For a 
randomly sampled- examinee Pr(z.=l) = t • . Note that from previous results 
Pr(z.=l) = Pr(x.-=1). The k out of n reliability of a test is defined 
to be 

This is the probability, that for a typical examinee, at least k correct 
decisions are made among the n items on a test. By a correct decision 
is meant the event of correctly determining whether the examinee knows 
an item. Knowing p,^ yields additional and important information about 
the accuracy of a test. An estimate Of p,^ is not available unless 
covCz.,z.)=0, or the number of items, is small. (See Wilcox, 1982g, 
xl982j.) 

For any two items, let p,^^ be the probability that a randomly '-c- 
lected examinee chooses the correct response on the k— attempt of the 
first item, and the attempt of the second. (It is assumed that both 
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items are administered according to an AUC scoring procedure.) Let 
K. .(1=0, .. . ,'t-l; :=0,...,t-l) be the proportion of examinees who can 
eliminate i distractors on the first item and j distractors o..n the 
second. Then, under certain mild independence assumptions 

t-k t-m 

i=0 j=0 ^ 

The equation makes it possible to express the <-jj'syn terms of the 
Pj^^'s which in turn makesj it possible to estimate k^.j for any i and j . 

Next let £ be the pr^pbability that for both items, a correct de- 
cision is made about an examinee's latent state. It can be seen that 

' ^ = ^t-l,t-l ^"Pll 
and so t can also be estimated. 

For the i~ and j— item on a test, let e-j be the value of e, 

and define \ 

n- 1 n 

i=l j=i+l ^ 

Uk=^s-.^ . 
where was previously defined to be Ex^ and 

= (2S - K(K-l)/2). 
Then from Sathe et al . Ci9P0) 

If 2V^_^ < (n+K-2)U^_^ 

p^ > 2((K*-1)Uk_i - Vk-1 - 
(K*-K)(K*-K+1) 
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where K* + K - 3 is the largest integer in 2V|^_^/U|^_j . Two upper bounds 
are also available. The first is ~' 
Pj, < 1 + ((n+K-l)U^ - 2V^)/Kn 

and the second is that if ZV^ < (K-l)U|^i 

< 1 - 2 ^^*-^^"k - h 

(K-K*)(K-K*+1) • - 

where K* + K - 1 is the largest integer in 2V|^/U|^. 

What these results mean is that we can estimate quantities that in- 
dicate whether P|^,is large or small. For exampl ev^suppose the right 
side of the third to last inequality is estimated to be .9, and that 
2V|^ 1 ("+K-2)U|^ ^. This does not yield an exact estimate of 

but it does say that is estimated to be least .9. Thus, this would 
indicate that the overall test is fairly accurate. If, for example, 
the above inequalities indicate that < .95 and > .1, this does 
not give very useful information about whether P|^ is reasonably large. 
If P|^ < .1 we have a poor test. 
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Estimating the Proportion of Items an 
Examinee Knows 

It is a simple matter to extend previous results to situations when 
a single examinee responds to items randomly sampled from some item domain. 
For example, let_iiL be the probability of a correct response on the i— 
attempt of a randomly sampled item'. Let Y-i(i=0, ... , t=l) be the propor- 
tion of items for which the examinee can eliminate i distractors. It is. 
assumed that each item has at least one effective distractor, so y^-.i is 
the proportion of .items the examinee knows. It follows that 

which is the same as equation (2.0) where p. and are replaced with 

q. and y.. In fact, all previous results extend immediately to the present 

case. ' . 

Criterion-Referenced Tests 

A common goal of a criterion-referenced test is to sort examinees 
into two categories. (See Hambleton et al . , 1978a; Berk, 1980; and the 
1980 special issue of Applied Psychological Measurement .) Frequently 
these categories are defined in terms of some true score, and here the 
true score of interest is Y^.^V the proportion of items in an item do- 
main that an examinee knows. The goal is to determine whether y^.i is 
larger or smaller than some predetermined constant, say y'- 
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It is known that guessing can seriously affect the accuracy of a 
criterion-referenced test (van den Brink and Koele, 1980). Moreover, 
assuming random guessing can be highly unsatisfactory (Wilcox, 1980c). 
Another advantage of the AUC scoring model is that it substantially re- 
duces this problem (Wilco)4.in press, b). For some results on comparing 
J to y' when equivalent items are available, see Wilcox (1980a). 

Sequential and Computerized Testing 

In certain situations, such as in computerized testing, sequential 
procedures will be convenient to use. Some progress has been made in 
this area, but much remains to be done. 

Suppose an examinee responds to items randomly sampled from an item 
domain and presented on a computer terminal. Further suppose the examinee 
responds according to an AUC scoring procedure. A typical sequential pro- 
cedure for this situation is to continue sampling until there are n items 
for which the examinee gives a correct response on the first attempt. Let 
y. (i=l, ... , t) be the number of items for which the examinee requires 
i attempts, to get the correct response. For the sequential procedure just 
described, sampling continues until y^ = n, in which case the joint prob- 
ability function of y2, ... . is negative multinomial given by 

■ " n y. 

fC/2'---'ytl^r---''^t^ = "^^^0^ .^ Pi /^i' 

^ 3 

t 

where y^ = I ^^"^ I ^' ~ O'l'--- 

The problem with the sequential procedure just described is that with 
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positive probability, the number of sampled items will be too large for 
practical purposes. This might be an extremely rare event, but it is 
desirable to avoid this possibility all together. A solution to this 
problem is to use a closed sequential procedure where sampling continues 
until y^=n^, or y2=n2, etc. where n^,...,n^ are positive integers chosen 
by the investigator. In this case the joint probability function of 



where I is the usual indicator function given by 

1-^1 ""i-" 0, if otherwise 
For the special case n^=n2=...=n, the probability function becomes 

^ ^i 
nr(yf>) n p. /y-! 

which has the same form ss the negative multinomial except that for some 

j, yj=n, and 0 < y^- f n-1, i/j. 

The maximum likelihood estimate of q- is q. = y/yg, so the maximum 
likelihood estimate of the proportion of items an examinee knows, is 

1 " ^1 " ^2 (-^ehna, 1966). If the model is assumed to hold, may 
not be a maximum likelihood estimate. Instead^ one would estimate to 
be zero when < 0; if the estimates of q. (i=l,...,t) do not satisfy 
the inequality q^ > > ••• > ^PP^^ pool -adjacent-violators algorithm 
(Barlow et al., 1972). 
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Wilcox (in press) shows that if the goal is to compare y 



t-1 



to the 



known constant y', as in criterion-referenced testing, and if y^^^ > y' 
is decided if and only if y^^^ > y* the sequential and closed sequential 
procedures have the same level of accuracy. Moreover, it appears that 
the closed sequential procedures nearly always improves upon the more 
conventional fixed sample approach. More recently Wilcox (1982f) pro- 
posed two tests of q^=. . .=q^, andmethods of determining the moments of 
the distribution were also described. 



Strong true score models attempt to relate a population of examinees 
to a domain of items. In many situations an item domain does not exist 
de facto, in which case strong true score models attempt to find a family 
of probability functions for describing the observed test scores of any 
examinee, and simultaneously to find a distribution that can be used to 
describe the examinees* true score. 

Perhaps the best known model is the beta-binomial. If y is the number 
of correct responses from an \examinee taking an n-item test, it is assumed 
that for a specific examinee, the probability function of y is/ 



A Strong True Score Model 



/ 

I 



For the population of examinees, it is assumed that the distribution of 



q is given by 




q 



r-1 



s-1 
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where r > 0 and s > 0 are unknown parameters that are estimated with ob- 
served test scores. Apparently Keats (1951) was the first to consider 
this model in mental test theory. 

The beta-binoinial model has certain theoretical disadvantages, but 
experience suggests that it frequently gives good results with real data. 
A review of these'results is given by Wilcox (1981d).. However, the model 
does not always give a good fit to data, and some caution should be exer- 
cised (Keats, 1954). In the event^ of a poor fit, a gamma-Poisson model 
might be considered (Wilcox, 1981d). 

When the beta-binomial is assumed, many measurement problems can be 
solved. These include equating tests by the equipercentile method, es- 
timating the frequency of observed" scores when a test is lengthened, and 
estimating the effects of selecting individuals on a fallible measure 
(Lord, 1955). Other applications include estimating thd reliability of 
a criterion-referenced test (Huynh, 1976a), estimating the accuracy of 
a criterion-referenced test (Wilcox, 1977c), and determining passing 
scores (Huynh, 1976b). 

A problem with the beta-binomial model is that it ignores guessing. 
Attempts to remedy this problem are summarized by Wilcox (1981d), but 
all of these solutions now appear to be unsatisfactory in most situations. 
This is unfortunate because it means that a slightly more complex model 
must be used. More recently, however, WilcQX C^L982a, 1982b) proposed 
a generalization of the beta-binomial model ,that takes guessing into 
account, and which gives a reasonably good fit to data. 
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Some Miscellaneous Applications of Latent Structure Models 

Several applications of latent structure" model s have already been 
described, and there are. several other situations where they may be use- 
ful. For example. Ashler derives an expression for the biserial corre- 
lation coefficient that includes the proportion of examinees who 
\now an item. Wilcox (1982g) discusses how to empirically determine the 
number of distractors needed on a multiple choiq'e test item, and Knapp 
('1977) discusses a reliability coefficient based on the latent state 
poi\t of view. (See also Frary, 1969.) Macready and Dayton (1977) 
illustrate how the models can be used to determine the number of equiv,- 
alent items needed for measuring an instructional objective, and Emrick 
(1971) shows how the models might be used to determine passing scores. 
Note that Emrick's estimation procedure is incorrect (Wilcox and Harris, 
1977), but this is easily remedied using the estimation procedures al- 
ready mentioned; closed form estimates are given by van der Linden 
(1981). 

5. POSSIBLE EXTENSIONS AND CONTROVERSIAL ISSUES 

The AUC models assumed that examinees eliminate as many distractors 
as they can and then guess at random from among the alternatives that 
remain. A recent empirical investigation suggests that the random guess- 
ing portion of this assumption will usually give a reasonable approxi- 
mation of reality (Wilcox, 1982k). No doubt there will be cases where 
this assumption is untenable in which case there are no guidel ines on 
how to proceed. 
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A theoretical, advantage of the latent structure model based on 
equivalent or hierarchically related items is that they included not 
only gues^ir'g, but errors such as Pr(incorrect response | exami nee knows). 
The practical ifnpli cations of this are not well understood. 

Wilcox (1981a) mentions that under an item sampling model for AUC 
tests, an examinee with partial information can improve his/her test 
score by choosing a response, and if it is incorrect, deliberately 
choose another incorrect response. Thus, if (y^^-yg)/" is used esti- 
mate ^, the estimate would be higher for such an examinee because 
y2 is lower. Four points should be made. First, this problem can be 
partially corrected by estimating the q.'s with the pool -adjacent- 
violators algorithm (Barlow et al . , 1972, pp. 13-15). Second, if an 
examinee is acting as described, it is still possible to correct for 
guessing by applying the true score model proposed by Wilcox (1982a). 
If it gives a good fit to data, estimate to be q -(l-q]^)5(q]L)- 

The third point is that there is no indication of how serious this prob- 
lem might be. Finally, a new scoring procedure is being examined that 
might eliminate the problem. . \ 

It has been argued (e.g., Messick;^ 1975) that tests should be homo- 
geneous in some sense. Frequently this\means that at a minimum, a test 
Should have a single factor. A sufficient condition for the best known 
latent trait models (see e.g.. Lord, 1980; Wainer et al . , 1980; Hambleton 
et al., 1978b; Choppin, this volume) is that this assumption be met 
(of McDonald, 1981). In general, the latent structure models described 
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in this paper do not require this assumption.- One exception is the equiv- 
alent item model. (See Harris and Pearlman, 1978.) The point is that 
in this paper, no stand on this issue is needed, i.e., it is irrelevant 
whether a test is homogeneous when applying, say, the answer-until - 
correct scoring procedure, or the corresponding strong true-score model. 

Wainer and Wright (1980) and Mislevy and Bock (1982) have studied 
the effects of guessing on latent trait models, but these investigations 
do not take into account the results and type of guessing described 
here. If guessing proves to be a problem, perhaps latent class models 
can be of use when latent trait models are applied. 
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6ENERALIZABILITY THEORY 

Noreen Webb 
University of California, Los Angeles 

Defi n_i t i on and F ocus 

General izability theory evolved out of the recognition that the 
concept of undifferentiated eror in classical test theory provided too 
gross a characterization of the multiple sources of error in a 
measurement. The multidimensional nature of measurement error can be 
seen in how a test score is obtained. For example, one of many 
possible test forms might be admiistered on one of many possible 
occasions by one of many possible testers. Each of these 
choices— test form, occasion and tester— is a potential source of 
error. G-theory attempts to assess each source of error in order tc 
characterize the measurement and improve its design. 

A behavioral measurement, then, is a sample ffom a universe of 

admissible obsrvations charactertzed by one or more facets (e.g., 

test forms, occasions, testers^l. This universe is usually defined by 

the Cartesian product of the levels (called conditions in G-theory) of 

the,:facets. From this perspective, Cronbach et al . (1972, p. 15) say: 

The score on which the decision is to be based is only one 
of many scores that might serve the same purpose. The 
decision maker is almost never interested in the response 
given to the particular stimulus objects or questions, to 
the particular tester at the particular moment of testing. 
Some, at least, of these conditions of measurement could be 
altered without making the score any less acceptable to the 
decision maker. That is to say, there is a universe of 
observations", any of which would have yielded a usable basis 

1 Introduction to G-theory are provided by Brennan (1977a, 1979a) 
Brennan and Kane (1980). Cronbach et al . (1972), Erlich and 
Shavelson (1976b) Gillmore (1979) Cardinet and Tourneur 1978), 
Huysamen (1980). Shavelson and Webb (1981), Tourneur (1978). 
Tourneur and Cardinet (1977). Van der Kamp (1976), and Wiggins 
(1973). 



for the decision. The ideal datum on which to base the 
decision would be something like the person's mean score 
over all acceptable observations, which we shall call his 
"universe score." The investigator uses the observed score 
or some function of it as if it were the universe score. 
That is, he generalizes from sample to universe. The 
question of "reliability" thus res olves into a question of 
accuracy of of general izat'i on " or general izabil ity. 

Since different measurements may represent different universes, 

G-theory speaks of universe scores rather than true scores, 

acknowledging that there are different universes to which decision 

makers may generalize, likewise the theory speaks of 

general izabil ity coefficients rather than the reliability coefficient. 

realizing that the value of the coefficient may change as defi nitons 

of universes change. 

G-theory distinguishes a decision (D) study from a 
general izabil ity (G) study . This distinction recognizes that certain 
studies are associated with the development of a measurement procedure 
(G studies) while other studies then apply the procedure (D studies). 
Although the decision-maker must begin to plan the D study before 
conducting the G study, the results of the G study will guide the 
specification of the D study. In planning the D study, the decision 
maker (a) defines the universe of generalization and (b) specifies his 
proposed interpretation of a measurement. These plans determine (c) 
the questions to be asked .of the G study data in order optimize the 
measurement design. Each of these points is considered in turn. 

(a) G-theory recognizes that the universe of admissible 
observations encompassed by a G study may be broader than the universe 
to which a decision maker wishes to generalize. That is, the decision 
maker proposes to generalize to a universe comprised of some subset of 
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the facets in the G study. Thejuni verse is called the universe of 
generalization . It may be defined by reducing the universe of 
admissible observations, i.e. by reducing the levels of a facet 
(creating a fixed facet; cf. fixed factor in ANOVA) by selecting and 
thereby controlling one level of a facet, or by ignoring a facet. All 
three alternatives have consequences for the estimation of the 
components of error variance that enter into the observed score 
variance. 

(b) G-theory recognizes that decision makers use the same test ' 
score in different ways. For example, some interpretations may focus 
on individual differenes (i.e., relative or comparative decisions)^ 
some may use the observed score as an estimate of a person's universe 
score (absolute decisions; cf. criterion-referenced interpretations), 
while still others may use the observed score in a regression estimate 
of the universe score (cf.Kelley's, 1947, regression estimate of true 
scores). There is a different error as^Qciated with each of these 
proposed interpretations. 

To illustrate the distinction between relative and absolute 
decisions, suppose that a decision is to be made using scores on an 
objective test of arithmetic. As an example of a relative decision, a 
decision-maker might want to channel the top 20 percent of the scorers 
into an above-average academic track (regardless of their actual 
scores). In this case, if all items on the test rank students in the 
same way, even if some items are more difficult than others, it would 
not matter to a student which items he or she received. The same 



students would be sc ected for the accelerated track whether the test 
consists of easy items or difficult items. In more formal terms the 
variation in item means would not be a part of error. As an example 
of an absolute decision, a decision-maker might want to select for 
accelerated placement all students who answer correctly 75 percent or 
more of the items on the test. In this case, the variation in item 
means would contribute to error. Even if all items rank students in 
the same way, a test composed of easy items would place more students 
into the accelerated program than a test composed of difficult items. 

(c) Ordinarily, the universe of admissible observations in a G 
study is defined as broadly as possible within practical and 
theoretical constraints. In most cases Cronbach et al . recommend 
using a crossed G study design so that all sources of error and 
interactions among sources of error can be estimated. (It should be 
noted, however, that a nested G study is sometimes useful because it 
provides more degrees of freedom for some estimates of sources of 
error.) The design of D studies, on the other hand, can vary widely 
and include crossed partially nested, and completely nested designs. 
Often, in D studies, nested designs are used for convenience, to 
reduce costs, for increasing sample size, orfor a combination of these 
reasons. All- facets in the D study design may be random or only some 
may be random. 
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Development of the Model 

Scores and variance components . In G-theory a person's score is 
decomposed into a component for the universe score (yp ) and one or 
more error components. To illustrate this decomposition, we consider 
the simplest case for podagogical purposes-a one facet, p x i (person 
by, say, item) design. (The object of measurement, here persons, is 
not a source of error and, therefore, is not a facet.) The 
presentation readily generalizes to more complex designs. In the p x 
i design with generalization over all admissible items taken from an 
indefinitely large univese. the score for a particular person (p) on a 
particular form (i ) is: 



(1) 



Xp^. = y (grand mean) 

+ yp - y (person effect) 

+ y. - y (item effect) 

+ Xp. - yp - y-j + y (residual ) 



Since this design is crossed all persons receive the same items. 
Except for the grand mean, each score component has a distribution. 
Considering all persons in the population, there is a distribution of 
yp - y with mean zero and variance ^iv^ - y)^ = a^ ""^'^^ "^^^^ 
the universe-score variance and is analogous to the true-score 
variance of classical theory. Similarly, the component for item has 
mean zero and variance |(y. - v f = which indicates the variance 
of constant errors associated with items while the residual component 
has mean zero and variance a^. which indicates the person x item 
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interaction confounded with residual error, since there is one 
observation per cell- The collection of observed scores. Xpi has a 
variance of = £ (Xpi -vi)2 which equals the sum of the variance 
components: 



(2) 



2' 2 2 2 



G-theory focuses on these variance components They are 
estimated by means of a general izabil ity (G) study. The relative 
magnitudes of the components provide information about particular 
sources of error influencing a measurement. It is convenient to 
estimate variance components from an ANOVA of sample data. Numerical 
estimates of the variance components are obtained by setting the 
expected mean squares equal to the observed mean squares and solving 
the set of simultaneous equations as shown in Table 1. 

Table 1 

Estimates of Variance Components for a 
One Facet p x i Design 



Source of 
Variation 



Mean 
Square 



Person (p) MSp 
Item (i) MS-j 
pi,e MS 



res 



Expected 
Mean Square* 



2 2 

"pi "i-p 

2 2 

a . n^a. 

pi,e pi 



2 
a . 
pi ,e 



^n^ = number of items; np = number of persons. 



Estimated 
Varianced 
Component 



a . = MS 
pi ,e res 
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Estimation of errb£> Not only do the magnitudes of the variance 
components show the importance of each source of error in the 
measurement, they can be used to estimate the total error for relative 
and absolute decisions. For relative decisions , the error in a p x i 
design is defined as: 

• 

(3) = (Xpj - yj) - (yp - y) , 

where I indicates that an average has been taken over the levels of 
facet i under which p was observed. The variance' of the errors for 
relative decisions is: 



2 2 2 , . 
(4) °6 = °pl = °pi,e/"i ' 



where n/ indicates the number of conditions of facet i to be sampled 
in a D study. Notice that (a) Op-,- ^e^"i ' standard eror of the 

mean of a person's scores averaged over the levels of i (items in our 

example). And (b) the magnitude of the error is under the control of 

2 , . 

the decision maker in, the D study. In order to' reduce^o ^, may be 
increased. This is analogous to the Spearman-Brown prophecy formula 
in classical theory and the standard error of the tnean in sampling 
theory. 

For absolute decisions, the error is defined as: 
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The variance of these errors in a p x i design is; 



(6) °A = + V = ^^"i^ °pi,e/"i 



2 2 

In contrast to includes the variance of constant errors 

associated with facet i (o^ ). This arises because, in absolute ■ 
decisions, the difficulty of the particular items that a person 
receives will influence his observed score and, hence, the decision 
maker's estimate of his universe score. For relative decisions, 
however, the effect of item is constant for all persons and so does 
not influence the rank ordering of them (see Erlich & Shavelson, 
1976b). 

Finally, for decisions based on the regression estimate of a 
person's universe score, error (of estimate) is defined as: 

(7) = ^ - ^ ' 

where Mp is the regression estimate of a person's universe score. 

The estimation procedure for the variance of errors of estimate may be 

found in Cronbach et al . (1972, p. 97ff). 

The variance components from a crossed p x i G study design can 
also be used to estimate error in a nested D study design with items 
nested within persons (we write i :p to denote nesting). So, the 
effect of the constant errors associated with facet i is confounded 
with the effect associated with the person by i -facet interaction 
(pi.e). Hence, , - 
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(8) °-Xpj % ^ °I.pI.e = °p ^ °A 



' . . • 2 _ 2 
Note that, for a completely nested design, - . 

General izability coefficients . While stressing the importance of 

2 

variance components and errors such as ,/ general izability theory 
also provides a coefficient analogous to the reliability coefficient 
in classical theory. A general izabil ity (G) coefficient can be 
estimated for each of a variety of D study designs using the estimates 
of variance components and error produced by the G study. A 
decision-maker can then use the estimated G coefficients to choose 
a\nong the^t study designs. For the one-facet case described here, ^ 
general izabil ity coefficients can be estimated for crossed or nested D 
study designs with any number of items. For designs with more than 
one facet, there are many D study designs possible each with an 
estimated G coefficient. 

The^eneralizatility (G) coefffcient, CP^, for relative 
decisi(|ns is defined as the r^tio of the universe-score variance to 
the expected observed-score variance, i.e., an intraclass correlation: 

2 2 

2 fp ^ 

(9) ?p ="i7;7 = "2-7-2 

CO (X) a + a 



The expected observed-score variance is used in G-theory because the 
theory assumes only' random sampling of the levels of facets and so the 
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observed-score var/iance may change from one application of the design 
•to another. Sample estimates of the parameters in (9) are used to 
estimate the G coefficient: 



-2 
P ^ 



is a biased but consistent estimator of Kq - 
For absolute decisions a generalizability coefficient can be 
defined in an analogous manner: 



a +0, 
P A 



2 



(10a) 




Finally, note that, for completely nested designs regardless of 
whether relative or absolute decisions are to be made, error variance 
is defined as and so (10) provides the generalizability coefficiunt 

j 

for such 'designs. 
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A two-faceted exampje^. A study of the dependability of measures 
of mathematics achievement illustrates the theory's treatment of 
multifaceted measurement error." In designing a general izability (G) 
study, the decision-maker specifies possible sources of error in the 
measurement of mathematics achievement. Variablility across test 
items is clearly a po;»^sible source of error. Furthermore students 
may obtain different scores on multiple occasions even though no 
learning has taken place between occasions, so occasions is a possible 
source of error." (It is assumed that true ability is constant from 
one occasion to the next. Therefore, a time interval between 
occasions must be selected that is short enought to prevent true 
changes from taking place-learning or maturation-but is long enough 
to prevent students' memory of the test from influencing their 
scores.) Another source of error might be item format, such as 
Ultiple choice, true-false, or open-answer (student fills in the 
correct answer): Students' scores might differ across item formats. 
For the present illustration, the item and occasion sources of error 
will be considered. 

In the general izability study, thirty tenth-grade students (p) 
were administered a twenty-item (1) test on two occasions (j). In 
differentiating students with respect to mathematics achievement, 
errors in the measurement may arise from inconsistencies associated 
with items, occasions, and other unidentified sources. G-theory 
incorporates these potential sources of error into a measurement model 
and estimates the components of variance associated with each source 
of variation in the 30 x 20 x 2 (p x 1 x j) design. 

[ 
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Table 2 enumerates the sources of variation and presents the 
estimated variance components for the mathematics test. 

Table 2 

General izability of Measures of Mathematics Achievement 



ERIC 



Estimated Vari ance Components 

Source of . , • o 

Variation' ni'=l.nj'=l ni'=10.nj'=l ni=10,nj^=a^ 

Students (P) 
Items (I) 
Occasions (J) 
PI 
PJ 
IJ 

Residual (PlJ.e) 



7.55 


7.55 


7.55 


1.73 


.17 


.17 


.96 


.96 


.48 


5.42 


.54 


.54 


.71 


.71 ' 


.36 


.50 


.05 


.02 


4.88 


,.49 


.25 



"2 

•i 






11 


.01 


1.74 


1.15 


6 coefficient 


for 


relative 


decisions . 


.39 


.81 


.87 








14 


.20 


2.92 


:i.82 


G coefficient 


for 


absolute 


decisions 


.35 


.72 


.81 



The first column shows that three estimated variance components are 
large relative to the other components. The first, for students (a^ ) 
is analogous to true score variance in classical test theory and is 
expected to be large. The second, the student by item interaction (a^.) 
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represents one source of measurement error and is due to the tendency 

« 

of different items to rank students differently. The third is the 

residual term representing the three-way interaction between students, 

items, ahd occasions and unidentified sources of measurement error 

(a^..^ ). The small components associated wth occasions (the J, Po, 
^ pij.e' 

IJ components) suggest that the occasion of testing introduces little 
variablility into the measurement of mathematics achievement. Average 
student performance over items is similar across occasions (a| ); 
students are ranked nearly the same across occasions (S^^. ); and item 
means are ordered nearly the same across occasions (0?^ ). \The 
optimal D study design then, will include multiple test items but few 
ocas ions. 

Table 2 also gives estimated variance components, error, and 
general izability coefficients for three D study designs: one item and 
one occasion, ten items and one occasion, and ten items and two 
occasions. Information is presented for both relative and absolute 
decisions. As described earlier, a relative decision might be to 
select the top 20 percent of the scorers for a special program. The 
variance components contibuting to error in this case include the 
components for all interactions with persons: PI, PJ, and PIJ,e. 
These are the only components that influence the rank ordering of 
students. An absolute decision might be to select all students who 
obtain a score of 75 percent correct or better. The error in this 
case consists of all components except that for students: I, J, PI, 
JP, IJ, and PIJ,e. All of these components influence students' 
absolute level of performance. As the estimates of error and 
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gene^ralizability coefficients in Table 2 indicate, administering a 
ten-item test on one occasion would substantially reduce error over a 
single item. Increasing the number of occasions to two would reduce 
error by only a small amount. The small reduction in error may not 
warrant the extra time and expense irvvolved tVi administering the test 

tW-ice. ' . . 

Typically, several D study designs will yield the S'§me level of 
generalizability. For a decision-maker wh9 desires a generalizability 
coefficient (relative decision) of .87, for example, there are at 
least two D study designs to choose from. As indicated in Table 2, 
ten items administered on two occasions would be ex.pected to produce 
this level of generalizability. Alternatively, 25 items administered 
on one occasion would also produce this resu'.t. The decision-maker 
must balance cost considerations to choose ':he appropriate D study 
design. When items are difficult and expensive to produce, the former 
desigo-^nay-^jrTnorTTractical. When itf.ms are fairly easy to generate 
(as is probably the case in tests if mathematics achievement), the 
latter design may be preferable. 
Assumptions . 

Lack of restrictions . Before discussing the assumptions 
underlying the generalizability model and procedures, it is 
instructive to describe which assumptions and restrictions occurring 
in other measurement theories (for example, classical theory) are not 
heldvjn generalizability theory. First, generalizability theory 
avoids the classical assumption of. paral lei i sm: equal means, 
variances and intercorrelations among condit^'ons of a facet (for 
example,, item scores). The lack of these assumptions has implications 



for the interpretation of the result-s of G and D studies. One cannot 
assume that conditions sampled within a facet are equivalent. For 
example, one cannot assume that items sampled fop| a study have the 
same means, variances and inteYcorrelations. Furthermore, conditions 
sampled across^ studies cannot be assumed -to be equivalent. For 
example, the ftems. sele^cted for the G study may not have the same 
level of difficulty as those selected for the D study- Moreover, the 
Items in one D study may not be equivalent to those selected for 
another D study. The differences among conditions and between sets of 
conditions may -be due to characteristics of examinees as well as 

characteristics of items. 

To deal with the difficulty that one set of conditions sampled in 
a D study (for example, items or occasions) may not be equivalent to 
each other or to another set, Cronbach et al (1972) discuss an 
item-sampling design ppoposed by Lord and Novick (1968). In this 
plan, a large sample of persons is subdivided at random into three or 
more subsamples. In the S study, each subsample would be observed 
under the^^set of coditions to be sampled in the D study and one 
additional cortdition. The additional condition would be different for 
each subsample. Each subsample, then, would be observed under 
identical conditions plus one different condition. A comparison of 
the results (variance component estimates) across subsamples would 
reveal how well the set of conditions to be sampled in the D study 
represent the universe of conditions-. If the results across 
subsamples are similar, then one can confidently generalize the 
results of the D study to the conditions in the universe of 

I ^ 234 . 
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generalization. If the results are different across subsamples, one 
must be very cautious in generalizing beyond the condiMons (for 
example, items) sampled in the D study./ 

Second, the general izability model makes no assumptions about the 
distributions underlying the measurements obtained in the G and D . 
studies, or of the. universe scores. Little is known, however, about 
the effects of different underlying distributions of scores on the 
estimates of variance components and the efficiencies of the 
estimators. It should be noted that general izabi 1 ity theory does make 
assumptions about the distributions underlying variance component 

estimation (see next section). 

Third, there is no restriction about the kinds of conditions that 
can be defined as facets. Any source of variation can be defined as a 
facet including, for example, test item, test form, item format, 
occasion of testing, and test administrator. General izability theory 
may be the only way\to disentangle the effects^ of these sources of 
variation. Item-response models are not able to deal with the effects 
of administrator variation, for example. 

Random sampling . One of the few assumptions of general izability 
theory is random samplirig" of persons: and conditions (for random 
facets). Although this assumption-is considerably weaker than the 
assumption of classical theory that conditions are strictly parallel 
(equal means , ^ari ances , correlations), it has often raised objections 
from those who maintain that measurements rarely consist of random 
samples f rom wel 1 -def i ned universes of generalization (for example, 
Loevinger, 1965; Rozeboom, 1966; Gillmore, 1979). As Kane (1982, 
p. 30) points out, "The effec'ts'o'f Unintended departures from the 
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random sampling assumption cannot be evaluated accurately, and 
therefore the interpretation of G-study results must always be 
somewhat tentative." ^ 

Brennan (1981) setd a more optimistic tone by suggesting that the 
universe of generalization need not be undifferentiated (as, for 
example, a univers/t of test items), but may be^structured such-^that 
the assumption of random sampling is more acceptable (for example, 
sampling from categories representing different item or content 
speci f ications). 

Lord and Novick (1968, p. 235) also provide support for the 
random sampling assumption, which is relevant for general izabil ity 
theory: 



A possible objection to the item-sampling model (for 
example, see Loevinger, 1965) is that one does not ^ 
ordinarily build tests by drawing items at random from a 
pool. There is, however, a similar and equally strong 
objection to classical test theory: Classical theory 
requires test forms that are strictly parallel, and yet no 
one has ever produced two strictly parallel forms for any 
ordinary paper-and-pencil test. Classical test theory is to 
be considered a useful idealization of situations 
encountered with actual mental tests. The assumption of 
random sampling of items may be considered in the same way. 
Further, even if the items of a particular test have not 
actually been drawn at random, we can still make certain 
interesting projections: We can conceive an itemVoopulation 
from which the items of the test might have been randomly 
^drawn and then consider the score the examinee would be 
expected to achieve over this population. The abundant 
information available on such expected scores enhances their 
natural interest to the examinee. 
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Infinite universe . Related to random sampling assumption 
described above is the assumption for random facets that the number of 
conditions in the universe of admissible conditions be indefinitely 
large. When the universe (of admissible observations or of 
generalization) is finite, the analysis and interpretation need to be 
adjasted, depending upon the, relationships among the number of 
conditions sampled in the G study, the number of conditions in the 
universe of admissible observations, and the number of conditions in 
the universe of general izaton. The universe of admissible observatins 
comprises all possible combinations of conditions represented in the G 
study. The universe of generalization consists of those combinations 
of conditions over whi\;h the decision-maker wishes to generalize. 
Although the two univ^r^^ may be the same, the universe of 
generalization often will^be smaller (fewer facets) than the universe 
of admissible observations. For example, a G study with items, test 
administrators, and occasions as facets may show little variability 
due to test administrators and occasions but substantial variability 
due to items. For the D study, then, the decision-maker may decide to 
use one test administrator and administer the test on only one 
occasion but use multiple items. The universe of admissible 
observations would 'have three facets; the universe of generalization 
would have one facet. Cronbach et al . (1972) consider several 
possibilities of finite universes and describe the implications for 
analysis. As Cronbach et al . point out, the intermediate cases in 
which a subset of a finite universe of conditions is sampled can be 
complex. 
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-In most applications, the decision-maker's choice is between 
random sampling from an indefinitely large universe (random facet) or 
inclusion of all of a finite set of conditions (fixed facet). In the 
latter case, Shavelson and Webb (1981) recommend that the 
decisionTmaker examine the variablility of the conditions of the fixed 
facet. 'If the variability is small, the' scores can be .averaged oyer 
conditons of the fixed facet. When the variability is large, however, 
each condition should be treated separately or the scores should . 
should be treated as a profile. Whenever there is a question about 
the magnitude of the variability, it may be most reasonable to present 
the results for each condition separately as well as the average over 
the conditions of the facet. This recommendation applies to the D 
study as well as to the G study. 

Variance components . General i zabi 1 ity theory assumes that the 
distributions underlying variance components are normal and that 
variance components cannot be negative. Analyses of non-normal 
distributions of variance components by Scheffe (1959; see Cronbach et 
al., 1972, p. 52) suggest that departures from normality can have l 
large effect on the "trustworthiness" of the confidence interval 
around the variance component. 

Negative estimates of variance components can arise as a result 
of sampling variability or model misspecif ication. For example, a 
random-effect model may not be valid (Nelder, 1954). Cronbach et al . 
(1972") suggest that zero be substituted for negative estimates and 
substituted in any expected mean square equation containing that 
component. As Scheffe (1959) and others have pointed out, the zero 
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estimates and modified estimates for other effects are biased. The 
greater the number of facets In the design (particularly for crossed 
designs), the greater the potential for a large number of biased 
estimates of variance components. 

The problem of negative estimates of variance components is not 
insurmount'able, however. Cronbach et al ^ (1972) suggest thq use of a 
Bayesian approach, which not only provides a solution to the problem' 
of negative estimates, but also provides estimates of variance 
components that are interpretable with respect to the sample data, not 
to repeated sampling. Fyans* (1977; see also Box & Tiao, 1973; Davis, 
1974; Hill, 1965, 1967, 1970; Novick et al . . 1971) strategy for 
obtaining Bayesian estimates constrains the estimates to be greater 
than or equal to zero. The resulting estimates are biased, however. 
Limitations of the Procedures 

The two major limitations of the procedures of general izability 
theory to be discussed here are the need for extensive data for 
reliable estimates of variance components, and the difficulties of 
estimation in unbalanced designs. It should be noted that these 
linlitations are not weaknesses in the theory but are difficulties 
arising in practice. 

Sampling variability of estimated variance components . Si nee 
G-theary emphasizes the estimation and interpretation of variance 
components, their sampling variability is of great Importance, albeit 
seldom addressed. Two issues arise: a compa rison o^ sampling 
' variability of variance components for different effects in a design. 
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and the magnitude of sampling errors in studies with mod^ raj: e numbers 
of observations. 

Concerning the first issue, a comparison of sampling variances 
for different effects in a G-theory design suggests that the sampling 
estimates of the universe score variance may be less stable than 
estimates of Components of error variance. This result derives from 
an inspection of general formulas for sampling variances of estimated 
variance components (see Smith, 1978). In fully crossed designs, at 
least, the formulas for sampling variability of estimated variance 
components for main effects contain more components, and (for moderate 
numbers of persons and conditions) can be expected to yield a larger 
sampling variance estimate, than the formulas for higher-order 
interaction effects. An illustration of this result for a two-facet, 
crossed (p x i x j), random model design comes from Smith (1978, 
Figure 1). The variance of the estimated variance component for 
persons (the universe score variance) is 



(,2 + + _£1 + -Jlf-) 



^T^yT) ^ n. n.n. ^ T^-^) ' n. n.n. 



+ 1 ( res . 

Ij^^W^ ^ n.n^. ' ^ 

while the variance of the estimated component for the residual is 



2 

^^"'(^res^ = (np-l)(n.-l)(n.-l) ""res 



24U 



\ . . 

In general, the sampling errors are expected to be greater for designs 
with greater numbers of facets than for designs with few facets, thus 
producing a trade-off between band width and fidelity. 

The second issue concerns the magnitude of sampling errors of 
estimated variance components. Monte carlo simulations conducted by 
Scnith (1978, 1,980), Calkins et al • (1978), and Leone and Nelson (1966) 
for a variety of crossed and nested designs produced large, sampling, 
errors for small and moderate numbers of persons and conditions. 
Smith, for example, found that "(a) the sampling errors of variance 
components are much greater for multi faceted universes than for single 
faceted universes ; (b) for the sampling errors were large 

unless the total number of observati^ons (npn^nj ) was at least 800; (c) 
stable estimates of a? and required at least eight levels of 

each facet; and (d) some nested designs produced mere stable estimates 
than did crossed designs" (Shavelson & Webb, 1981, p. 141). Smith's 
results pose a serious problem for the interpretation of results in 
the moderately sized designs typically used. The requirements of 
large numbers of conditions and large numbers of total observations 
for stable estimates of variance components are rarely met in most G 
and D studies. 

Woodward and Joe (1973) and Smith (1978) recommended that 

measurements be allocated in the D study in specific ways to minimize 

sampling variability. For example, in a p x i x j design, they 

recommended using equal numbers of conditions of facets i and j when 

increases relative to a^. and a^. /, and making the 
res pi PO 

numbers of conditions of facets i and j proportional to ^p-^/^pj 

2ii 



when a^^g decreases relative to a^A and a^^ . These decisions 
are based on the results of the G study. 

To deal with the requirement of large numbers of observations. 
Smith (1980) also proposed the use of severVl small G studies with 
many conditions of a few facets, each estimatVig part of a complex G 
study, instead of one large G study with a few Conditions of many 
facets.'- As Shavelson and Webb (1981 \ point out,Wever, there is a . 
question of how veil the restricted universes of the several small 6 
-studies represent the universe of the single, large G study. 

■ Unbalanced designs . A major difficulty with the ANOVA approach 
to estimating variance components arises in unbalanced designs, in 
which there are unequal numbers of observations in its subclassi f ica- 
tions. An example which occurs in many tests is an unequal number of 
items across subtests. Another example is students nested within 
classes where class size varies. The primary difficulty with 
unbalanced data is computational complexity. The usual rules for 
deriving expected values of mean squares (Cornfield & Tukey, 1956) do 
not apply to unbalanced designs. Although computer programs have been 
developed to estimate variance components in unbalanced designs, they 
require large storage capacities and, therefore, may be prohibitively 
expensive in many cases. (For descriptions of the computer programs, 
see Brennan et al., 1980; Llabre, 1978, 1980; Rao, 1971, 1972.) 
Strengths and Weaknesses of tfie Model 

The major strength of general izabil ity theory is its ability to 
assess sources of error in the measurement and, consequently, to 
design optimal decision-making studies. This ability affects not only 
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a specific decision-maker's study but, as Cronbach et al . (1972, p. 

384) point out, it can help evaluate existing testing practices: 

Application of general izability theory should operate- 
ultimately to increase the accuracy of test 
interpretations. It will make interprjetation more cautious 
as the inadequate general izabi 1 ity of a procedure becomes 
recognized, and it will encourage the development of 
procedures more suitable for generalized interpretation. 

The weak assumptions afford the decision-maker great flexibility 
in designing general izability and d&cision studies, and in defining 
relevant universes of interest. At the same time, however, the lack 
of assumptions leaves several questions unanswered. One is the lack 
of guidelines about the reasonableness of data. For example, the 
effects of outliers or influential observations on the estimates are 
not well known. 

Present Areas of Application \ 

Reliability . As was described in the first section of this 
paper, a primary goal of G-theory is to design measurement procedures 
that minimize error variability, and thereby maximize reliability, 
while at the same time allowing the decisto-n-nfaker to generalize over 
a broad range of testing situations. General iza1)il ity theory has been 
applied to a variety of\areas in the behavioral sciences to study the 
dependability of measures of the behavior of schizophrenic patients 
(e.g., Mariotto S Farrell , 1979), assertion In the elderly (Edinberg 
et al., 1977), free-recall in children (Peng & Parr, 1976), depth and 
duration of sleep (Coates et al.. 1979), behavior of teachers (Erlich 



& Shavelson, 1978), dentists' sensitivity toward patients (Gershen, 
1976), educational attainment (Cardinet et a1., 1976), job 
satisfaction using Spanish and English forms (Katerberg et a1 . , 1977), 
student ratings of instruction (GiUmore et a1 . , 1978), and 
heterosexual social anxiety (Farrell et a1 . , 1979). 

Linked conditions and multi va¥iate estimation . Educational and 
psychological measurements often provide multiple scores Which may be 
interpreted as profiles (for example, patterns of scores on the 
Wechsler Intelligence Scale for Children are used to place students in 
special education programs) or composites (for example, the 
Comprehensive Test of Basic Skills). Although the most common 
procedures used to assess reliability focus on the separate scores or 
on the composite, neither method assesses the linkage or error 
covariation among the multiple sct^^es. For example, subtest scores 
from the same test battery are "linked" by vi rtu^ of occurring on the 
same test form and on the same occasion. Information about the ' 
covariation among scores is important for designing an optimal D 
study, and permitting the decla^iEjfi-maker to determine the composite 
with maximum general izability. For these purposes, a multivariate 
analysis is more appropriate (see Cronbach et al . , 1972; Shavelson & 
Webb, 1981; Travers, 1969; Webb & Shavelson, 1981). 

In extending G-theory's notion of multi faceted error variance to 
multivariate designs, subtest scores, for example, would be treated 
not as a facet of measurement but as a vector of outcome scores. 
While univariate' G-theory focuses on variance components, multivariate 
G-theory focuses on matrices of variance and covariance components. 
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The matrix of variances and covariances among observed scores is 
decomposed into matrices of components of variance and covariance. 
The expected mean square and cross-product equations from a 
multivariate analysis of variance are solved in analogous fashion to 
their univariate counterparts. For example, the decompositio of the 
variance-covariance matrix of observed scores in a one-facet, crossed 
dfesign with two dependent variables (for example, the grammar and 
paragraph comprehension subtests in a language arts battery) is: 






a2(-jp) a(iP,2P) 
a(iP,2P) cj2(2p) 



(observed scores) 



^persons) 



a-l-ji) a(^i,2g) 
a(ii,2g) a2(2g) 



(conditions) 



a2(,pi ,e) 



a(^pi ,e,2Pg,e) 



a(-|Pi ,e,2Pg,e) a2(2pg,e) 



(residual ) 
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where -.X . = score on variable 1 for person p observed under 
condition i, 

„X = score on variable 2 for person p observed under 
2 pg 

condition g, and 
j^P = abbreviated for : the universe score on variable 

1 for person p. 



In the above equation, the term aC^p.gP) is the covariance between 

universe scores on variables 1 and 2 (grammar and paragraph 
comprehension). The term a{^],^g) is the covariance between scores 

on the two variables due to the condition of observation. Facet i may 

be the same as facet g, for example, when the grammar and paragraph 

.Comprehension scores are obtained from the same test form (on the same 

^occasion). The term a{^pi ,e]^pq,e) is the covariance due to 

unsystematic error.' 

The matrices of variance and covariance components provide 
essential information for deciding whether multiple scores in a 
battery should be treated as a profile or a composite as opposed to 
separate scores. The matrix of covariance components for universe 
scores particularly shows whether it Ms reasonable to consider the 
scores as representing an underlying dimension, in which Case a 
profile or a composite are reasonable. Small covariance components 
relative to the variance components suggest that the scores are not 
related and that a composite of the scores would not be interpretable. 

Although the components of variance and covariance are of primary 
importance and Interest a decisi^pn-maker may find it useful to obtain 
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the dimensions of scores (composites) with maximum generalizability. 
The multivariate extension of the univariate generalizability 
coefficient was developed by Joe and Woodward (1976). From a random 
effects multivariate analysis of variance, the canonical variates are 

r 

determined to maximize the ratio of universe-score variation to 
univers-score plus error variation. For the two-facet fully crossed 
design, Joe and Woodward's multivariate coefficient for relative ^ 
decision is , ^ 

a'Vpa " . 



.p2 = 



^ I'VpA + a'Vpia^ + A'Vpjai + 



n^': nj' ni'nj 



where V = a matrix of -variance and covariance 

components estimated from mean square 
matrices, 

and n.-* = the number of conditions of facets i and j 
in a D study, and 
a = the vector of canonical coefficients that 
maximizes the ratio of between-person to 
between-person plus within-person variance 
component matrices. 
There is a set of canonical coefficients (a^) for each characteristic 
root in the above equation. Each set of canonical coefficients 
defines a composite of scores. By definition, the first composite is 
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the most reliable while the last composite is the least reliable. 
This procedure, therv^produces the most generalizable composite of 
subtest scores, /for example, that takes into account the linkage among 
the scores. 

An a£Pjicatjon of multivariate jeneral iza^^^^^ . 
arithmetic achievement (reported in Webb, Shavelson, & Maddahian, 
1982) will be used as an illustration. Three subtests representing 
basic computational skills (addition/subtraction, multiplication, and 
division) were selected from the mathematics battery at grade five 
from the Beginning Teacher Evaluation Study (BTES), a research program 
designed to identify effectie teaching behavior in elementary school 
reading and mathematics. A sample of 127 students completed the three 
mathematics subtests on two occasions. The design of the^^mul ti variate 
study, then, had one facet (occasions) crossed with persons. 

Table 3 presents the matrices of components of vari^ance and 
covariance for the three effects in the design: persons, occasions, 
and the residual. The subtantial components of covariance for persons 
(which is the universe-score component matrix) shows that the three 
subtests are substantially related and that it is reasonable to form a 
composite of the scores. The non-zero components of covariance for 
the residual show that the tendency for students to be ranked ordered 
differently across occasions (interaction between persons and 
occasions) is consistent across subtests. 

The dimensions of mathematical skill that have maximum 
generalizability are presented in Table 4. When the general izability 
of mathematics scores was estimatecl for a single occasion, one 
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diroension with -geiieral izabil ity coefficient exceeding .60 emerged from 
the analysis. This diifiension is a general composite heavily weighted 
by division. The analysis with two occasions produced two dimensions 
with general izability coefficients exceeding .60. The first is the 
gejiera] jcomppsite described above ; the second is a contrast between 
addition/subtraction and division. 
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Table 3 



Estimated Variance and Covariance Components for 
Multivariate General izability Study of Basic Skills (^o'^l) 



Source of Addition/Subtraction Multiplication Division 

Variation • , (1) [2] (3j 

Persons (P) (1) 2.27 

2) 2.08 5.64 

(3) 1.07 2.41 3.60 

Occasions(O) (1) .00 

(2) -.12 ■ 1.27 

(3) -.04 .49 .17 

P0,e (1) 2.34 

(2) .84 5.84 

3) .00 .28 1.74 
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Table 4 

Canonical Variates for Multivariate 
General izability Study of Basic Skills 



no = 1 ilo = 2 



II III I II III 



rrAWf tn WSUmr atTi-on- '-.-ll-™-.-36- ~ -.---3A~' ^ vil- - ;-4-2- -:-4-2 ■ 

2). Multipl ication 



(3) Di vi sion 
Coefficient of 

General izability (P ) 



.07 


-.11 


.31 


.07 


-.13 


.38 


.35 


.28 


-.12 


.37 


.33 


-.15 


.71 


.44 


.33 


.83 


.61 


.50 



New Areas of Application 

This section includes areas that have been developed but rarely 
applied in practice, including test design and estimation of universe 
scores and profiles, as well as areas that need to be developed, 
including estimation of .phenomena that change over time and the 
effects of underlying score distributions on estimation and sampling 
variability of estimators. 

Test design . General izability theory can be used in designing 
tests: for example, providing information on variability among 
subtests, items within subtests, and item formats. Any of these 
characteristics of tests can be used to define the universes of 
admissible observatil^rjs and generalization and can be included as 
facets in G and D stutlies. Complexly structured tests can even be 
considered, as in the case of unequal numbers of items for different 
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subtests, in a test battery. A straightforward way to deal with this 
case is to consider subtest as fixed, and to perform separate G 
analyses (with items as a facet) for each subtest. Conditions of the 
testing situation, as opposed to the test itself can also be taken 
into account, such as occasion, examiner, and scorer. 

-•Est>ttffaf1m"Of -o n4vepse--SGOpes-and-.pr^ — A..xontrlhut.lon of. 

^ \ . * 

general izability th^ory^is the estimation of point estimates of 

universe scores and of score profiles. Cronbach et al . (1972, p. 103) 

present an estimation equation (based on Kelley, 1947) for a point 

estimate of the universe score which is shown to be more reliable than 
observed scores: 



Although thi| procedure could be repeated for each subtest in a test 
battery, thus producing a universe score profile, it would not take 
full advantage of the relationships among the subtests. 

Cronbach et al . (1972, p. 313-314) show how the correlations 
among variables in a test battery can be taken into account to produce 
a more dependable profile of universe scores. Basically, the 
regression equation for a particular score in the profile inc.ludes not 
only the observed scores on that variable (as in the above equation) 
but also the observed scores for all other scores in the set. The set 
of multiple regression equation equations produces a profile of 
estimated universe scores for each person. This profile is more 
reliable (and usually flatter) than that based on univariate 
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regression equations. In an example using data from the Differential 
Aptitude Tests (DAT), Cronbach et al . (1972) reported reductions in 
error variance as large as 42 percent when all subtests were used as 
predictors compared to error variances from single predictors- Such 
universe score profiles are useful for guidance decisions and 
diagnostic purposes. It is important to note further that the 
regression methods outlined here may produce not only flatted profiles 
than observed scores, but sometimes will invert relationships in an 
observed-score profile. The important implication for counseling and 
research is that observed profiles and those estimated from univariate 
regr|essions may be much further from the true profiles than 
multivariate estimates. 

Changing phenomena vs. steady state phenomena. Al 1 of the 
discussion thus far has assumed that the phenomenon being studied 
remains constant over observations. The problem is very complex, 
however, when the universe score changes over time, as is the case in 
maturation studies (e.g., Bayley, 1968). This problem is particularly 
acute in testing situations which assume no change in true ability or 
knowledge across testing situations but in which sufficient time 
elapses that true changes do appear. A further complication is that 
the growth patterns of different individuals over time may not be 

equivalent. A few inroads Into this area are the work of Bryk (1980) 

I 

and'Maddahian (1982). 

Underlying sc ore distributions . The lack of knowledge about the 
impact of varying underlying score distributions on the estimation and 
sampling variability of univariate parameters, including universe 
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score estimates, variance components, and general irabil ity 
coefficients, and multivariate parameters, including universe score 
profile estimation, components of covariance, multivariate 
general izabil ity coefficients, and canonical coefficients, clearly 
presents an area in need of development. Issues needing to be 
addressed include bias and efficiency of the estimators. 
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ANALYSIS OF READING COMPREHENSION DATA* 



The data set used in this analysis is taken from the 1971 survey 
of reading achievement in the United States carried out in conjunction 
with the International Association for Educational Achievement's Study 
of Reading Comprehension in 15 Countries (Thorndike, 1973). The total 
sample cony sted _of _5,479^^f^^^ 

probability sample of 239 schools scattered across the United States 
(Wolf, 1977). Each of the selected students was asked to complete a 
variety of tests and questionnaires designed to establish the relative 
influence of various external factors to the development of reading 
achievement and an interest in reading. 

The international research program called for the administration 
of essentially the same tests (though translated into different 
languages) to comparable samples of students in each country. The 
"between country" variation in background factors, school organiza- 
tion, parental expectation and involvement, cultural importance of 
written communication, etc., offered a unique opportunity to use the 
natural laboratory to investigate their respective influences. It was 
necessary in such a research study, however, to develop the me -ure- 
ment instruments with great care. They not only had to be of high 
psychometric quality, but also had to be capable of translation into a 
range of languages so as to yield comparable, relevant, and fair 
measures of achievement in all the participating countries. For this 



* This chapter was compiled by David McAr^-hur from contributions by 
Bruce Choppin, David McArthur, Raymond Moy, and Noreen Webb. 



ERIC 




- 8.2 - 



reason, the tests do not appear "familiar" in content or style to 
those- regularly in use in any one country, but they were judged to be 
accessible enough to the average student in each country to yield an 
appropriately valid measure of achievement. 

Two separate reading comprehension tests were administered. Each 
consisted of short reading passages of between 100 and 200 words, 
■ foTtowed- by a -group- of -mu^Tti^l-e-ehoiee-questions the- a 
could be found in the passage. The first section consisted of four 
reading passages and a total of 21 items. The second section had five 
reading passages and 24 items. Treated together for this analysis, 
they yield a multiple-choice test of reading comprehension containing 
45 items (these are listed in Appendix I). 

In order to perform a fair comparison of the different 
mathematical models for measuring achievement, it was decided to limit 
the analysis to samples of 1,000 students drawn from the master set. 
As a back-up and to estimate the stability of the parameters oblfained, 
some analyses were repeated on a second, non-overlapping, sample of 
1,000 students. Four approaches were applied to the 45 items of the 
Reading Comprehension Test for these samples of 1,000 cases: S-P 
analysis, Rasch analysis. General izability analysis, & 3 parameter 
latent trait analysis. Each is taken in turn below,. 
S-P Analysis 

The S-P technique produced item p-values, person total scores, 
caution indices for both items and persons, the pair of curves (S & 
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P), the overall index of ordering and agreement with a perfect Guttman 
scale, and rank positions for both items and persons. 

Average difficulty is p=0.532 with a range of 0.854 to 0.167 (of 
which the -three^ most difficult items are answered correctly no better 
'^itian chance). D*, the indicator of hypothetical misfit, is 0.506, a 
-^fairly high value. The average caution index for items (C,-*) is 
0.250, ranging from 0.101 to 0.395. Eight of the items have caution 
indices exceeding 0.333. 

In decreasing order of severity, these are items 16, 39, 31, 20, 
43^ 44, 7, and 42. The range of caution indices (C-j*) for respondents 
is from 0.038 to 0.730, with only three persons achieving below 0.050 
but twenty-seven achieving above 0.500. There is a strong negative 
correlation (r= -0.45) between the Item difficulties and their caution 
indices. According to this solution, the test appears to contain a 
moderate number of items poorly suited to this sample. Many correct 
responses are likely to be the result of chance guessing, and fully 
one-fifth of the items are exceptionally poor at discriminating 
between ability levels. 

When those items with the highest caution indices are dropped 
altogether from the S-P analysis, the entire matrix and all associated 
indices for the items that remain and for aVl of the r^esiro^idents are 
recalculated. While the truncated test on average/is less difKicult, 
there is little comparable decrease in the overalj index of misfit. 
The number of respondents with elevated caution indices is exa/tly 
twice that of the first analysis, with the interesting finding that a 
proportion of that increase is to be found ir: the top-scoring 10% of 
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the sample. It seems that when some items are removed because 
evidence shows that responses to them are generally not in 
correspondence with student ability, the S-P approach then penalizes 
some of the upper-ability students. This occurs when' a student 
manages to get most of the included itams correct, and most of the^ 
excluded items wrong, but also had one or two additional wrong 
ansers. In the analysis of the full set of items, those last one. or 
two wrong answers do not cause\the caution index to be all that out of 
line, but in the truncated set, those wrong answers can contribute 
heavily. For those students at the opposite end of the ability scale, 
both the first and second analyses show a sizeable number of high 
caution indices and very few low caution indices. 

The low ability students are not measured well by this test, 
according to the S-P analysis, and generally there is an unanticipated 
large number of wrong answers by those whose overall ability level 
would have led one to expect success. The same findings proved true 
when the second sample of 1000 cases was ^analyzed, and also were 
obtained when the two- sections comprising the 45 item test were 
analyzed separately. 

\ 

^ Rasch Model Analysis 

Computations using the same data set made by a Rasch model item 
analysis are as follows. For the complete set of*45 items that make 
up the two tests, the range of item difficulty is 18 wits (or about 4 
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logits). This a fairly is typical value for a classroom achievement 
test (which of course this was not!). The test was constructed to 
meet the needs of an international project and was designed to be 
effective in a broad spectrum of some 20 countries. As a result it 
appears not to be matched exactly to this particular sample of 
students in the USA. Although the easiest item in the test would have 
been "difficult" for fewer than one percent of the sample, the most 
difficult item (number 31) would have appeared quite easy to about 25 
percent. For this particular group of students, the test could 
theoretically have- been improved by the inclusion of one or two more 
difficult items. 

In general the fit to the Rasch model was quite good. The worst 
fitting items were (in order of misfit) 16, 39, 43, 20, 31, 7, and 
44. These are all comparatively difficult items. The analysis was 
repeated eliminating these items (and item 32) and the overall fit 
improved considerably. However, it should be stressed that only items 
39 and 15 were sufficiently poor to the rejected by the usual Rasch 
item analysis criteria for fit. 

It would appear that the inclusion of more difficult items as 
suggested ten lines above, would likely not have improved the test 
overall because of misfit due to guessing, the analysis emphasizes 
the seriousness of guessing on a four-way multiple-choice test. 

There was a clear tendency for item discrimination to be related 
to item difficulty. The easiest items on the test discriminated well 
and the harder items comparatively poorly. All the misfitting items 
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were among the poor discriminators. When the analysis was repeated 
omitting the eight poorest fitting items, the trend linking 
discrimination to difficulty remained. Even though the most difficult 
items on this test are not really very difficult for most of the 
sample of students, it would appear that guessing was very 
widespread. This would account for the overall relationship between 
difficulty and discrimination. An index of item di scrijnlnitiOTi 
deduced from the measure of misfit to the'Rasch model correlated 
0.967 with Sato's Caution Index suggesting that these two are 
measuring essentially the same thing (fit to a Guttman model). 

To check t!.e stability of the estimation of item difficulty the 
analysis run on the first 1000 cases in the data set and reported 
above was repeated on the second 1000. The results showed a high 
degree of stability. The conventional p^values of the items on the 
two separate samples of students correlated 0.982, whil e- the. delta 
values resulting from the Rasch scaling analysis correlated 0.984. 

Each of the two sections of the test was composed of four 
clusters of items each relating to a short reading passage. These 
clusters vary little among themselves in terms' of item characteristics 
although it maybe noted that the first passage in each section 
( Tailor birds and Insect s) are easier than those that follow them, and 
the final cluster on the record section ( Musk Ox ) is somewhat less 
discriminating than the average. 
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A check was made to see if the items operated differently for 
boys and girls. In general no major discrepancies were discovered 
although a few differences in individual item difficulty did reach 
significance. For example, items 7, 12, 24, 27, and 35 were 
relatively easier for the girls while items 15, 32, 33, and 44 were 
significantly easier for the boys. When the clusters were examined 
further small, but significant, trends were noted. The passages about 
"seals" and "the poet" were somewhat easier for the girls, while the 
passage about "eskimos" slightly favored the boys. 
General izabil ity Analysis 

General izability analyses were performed to assess the magnitude 
of the sources of variation in the data set. The sources of variation 
include sex, persons, sections (first vs. second), passages (coded E 
in the tables), and items. The variation for persons is considered 
here to be the universe score variance (true score variance), AlT of 
the other sources of variation are considered error. For all cf/the 
analyses except that which includes sex, five Items were selected at 
random from each passage to make a balanced design. For the analysis 
of sex, an equal number, of boys an girls was selected. 
Four designs of the basic data set were analyzed: 

(1) Persons x Sections x Passages (Sections) x Items 
(Passages(Sections)) 

(2) Persons x Sections x Passages x Items (Passages) 
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(This design assumes that the same passages appeared in both 
sections and is probably not defensible. It was included to help 
disentangle the passage x section interaction in design (1).) 

(3) Persons x Sections x Items (Sections) 

(This design ignores passage as a source of variation.) 

(4) Persons x Sections x Items 

(This design assumes that each section has the same items 
and is probably not defensible. It was included to help disentangle 
the item x section interaction in the above design.) 

An additional design was included to assess the effects of sex: 

(5) Sex X Persons(Sex) x Sections x Passage(Sections) x 
Items (Passages (Sections ) ) . 

(This analysis is essentially the same design (1) with the 
additional stratification by sex.) 

Table 1 gives the variance aponents for the five designs. 
These variance components are estimates for one section, one essay, 
and one item. The variance component for sections is zero, indicating 
that students performed equally well on both sections of the test. 
The persons x sections (PS) interaction is also low|. Indicating that 
students are ranked equally on both sections of the' test. 

In two sections, passages and items have nontrivial variation, 
even if low. Some passages are easier than other passages dnd some 
items are easier than other items. The variance components relating 
to items are the highest. Further, there is some tendency for items 



- 8.9 - 



to rank students differently. To the extent that the section x item 
interaction can be interpreted, the position of item difficulties 
within one section does not correspond to the other section. In other 
words, while the early items in the first section may be the easiest 
in that section, the early items in the second section may not be the 
easiest items in that section. 

The large residual component in all designs suggests that there 
' may be other sources of variation in test scores that have not been 
accounted for in the above designs. 

Table 2 gives the general izability coefficients for a variety of 
decision study designs. The coefficients were computed for absolute 
decisions: taking into account the absolute level of performance as 
well as relative rankings among students. All sources of variation 
other than that for persons, therefore, contribute to error. These G 
coefficients are considerably lower than those for relative decisions 
which include only the sources of variation interacting with persons 

(e.g., PS, PE(S), etc.). 

The G coefficients for designs (1) and (2) are similar, as are 
those for designs (3) and (4). Increasing the number of items within 
each essay beyond 3 or 4 items has little impact on reliability, 
particularly, particularly when there are several passages in a 
section. Further, the total number of items seems to have the most 
impact of reliability; it does not matter how they are distributed 
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P X S X E(S) X I(E(S)) 



I 

Table 1 | 

! 

Variance Components from General izfibility Analyses 
P X S X E X 1(E)) 





2 




Source 


2 


% 


p 


.031 


12.4 


P 


.031 


12.4 


s 


.000 


0.0 


S 


.000 


0.0 


E(S) 


.006 


2.4 


E 


.005 


2.0 


I(SE) 


.022 


8.8 


1(E) 


.007 


2.8 


PS 


.000 


0.0 


PS 


.000 


0.0 


PE(S) 


.005 


2.0 


PE 


.000 


0.0 






SE 


.001 


.4 








PI(E) 


.004 


1.6 








SI(E) 


.015 


6.0 








PSE 


.005 


2.0 


PlCSE).e 


.187 


74.5 


PSI(E),e 


.18^ 

• / . 


72.8 



X X P(X) X S X E(S) X I(E(S)) 



Source 


2 


% 


X 


,,000 


0.0 


S 


.000 


0.0 


P(X) 


.031 


12.4 


E(S) 


.007 


2.8 


xs 


.000 


0.0 


■f.(SE) 


.022 


8.8 


PS(^) 


.000 


0.0 


XE(S) 


,.000 


0.0 


PE(XS> 


.005 


2.0 


XI (SE) .- 


.000 


0.0 


PI(XSE),e 


.186 


74.4 



P X 


s 


X I(S) 




P 


X S X I 




Source 




2 


% 


Source 


2 


% 


P 




.031 


12.4 


P 


.031 


12.4 


S 1 




.000 


0.0 


S 


.000 


0.0 


I(S) 




.027 


10.8 


I 


.011 


4.4 


PS 




.001 


0.4 


PS 


.001 


0.4 










PI 


.005 


2.0 










SI 


.016 


6.4 


PI(S) 




.191 


76.4 


PSI 


.186 


74.4 



P = Persons 
X = Sex 

S = Section (First ys. Second) 
E = Passage 
I = Item 
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Table 2 

General izability Coefficients for Absolute Decisions 



P X S X E(S) X I(ES) 
No. Of Sections = 1 



# of Passages 

# of Items 

2 
3 
4 
5 



.34827 
.43304 
.49306 
.53777 



.44488 
.53389 
.59324 
.63563 



.51653 
.60425 
.66032 
.69926 



P X S X E X 1(E) 
No. Of Sections = 1 

# of Passages 2 

# of Items 



2 
3 
4 
5 



.34496 
.42885 
.44821 
.53243 



.44056 
.52860 
.58728 
.62918 



.51142 
.59816 
.65359 
.69206 



No. Of Sections = 2 

# of Passages 2 

# of Items 



2 
3 
4 
5 



.51661 
.60437 
.66046 
.69941 



.61580 
.69613 
.74469 
.77723 



.68120 
.75331 
.79541 
.82301 



No. Of Sections = 2 

# of Passages 2 

# of Items 



2 
3 
4 

,5 



.49016 
.57439 
.62839 
.66595 



.58983 
.66848 
.66848 
.74829 



.65659 
.72811 
.72811 
.79761 



P X S X I(S) 

No. Of Sections = 1 

# of Passages 2 

# of Items 



2 
3 
4 
5 



.35798 
.45310 
.52251 
.57540 



, 3 



.45310 
.55063 
.61704 
.66518 



.~)2251 
.61704 
.67841 
.72146 



P X S X I 

No. Of Sections = 1 

# of Passages 2 

# of Items 



" 2 
'3 
4 
5 



.35393 
.44750 
.51567 
.56754 



.44750 
.54325 
.60833 
.65544 



.51567 
.60833 
.66838 
.71046 



No. Of Sections = 2 



# of Passages 

# of Items 

2 
3 
4 
5 



.52723 
.62363 
.68638 
.73048 



.62363 
.71020 
'.76317 
.79892 



.68638 
.76317 
.80839 
.83819 



No. Of Sections = 2 

# of Passages 2 

# of Items 



2 
3 
4 
5 



.50586 
.60239 
.66599 
.71091 



.60239 
.69019 
.74444 
.78128 



.66592 
.74444 
.79107 
.82197 
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across passages. For example, four passages with two items each has 

^^^"^"^^^^ebtTut the same reliability as two passages with four items each. The 

t 

same result holds for sections; it does not matter how items are 
distributed across sections. For example, in design (1), one section 
with four passages with two items each has a G coefficient of .52; one 
section with. two passages with four items each has a G coefficient of 
.52. All of the above combinations have eight items total. Similar 
combinations with a total of 16 items have G coefficient ranging from 
.66 to .68. 

The final analysis "examined sex as a source of variation. The 
component for sex was zero, indicating that boys and girls showed 
equal mean performance. Furthermore, the inclusion of sex did not 
affect any other component. In other words, items, passages and 
^ sections ranked boys and girls similarly. This finding seems to 
conflict somewhat with the finding in the Rasch analysis that some 
items ranked boys and girls differently. 
Three-Parameter Latent Trait Analysis 

With the introduction of an improved version of the LOGIST 
cofuputer program for estimating the parameters in latent trait models, 
its \se for examining test behavior is likely to become more 
widesp. Jad. However, a problem remains in the evaluation of the 
results, as the parameters derived by the program are likely to be 
unstable. The problem is to identify the sources of instability and 
to assess their relative effects on the parameter estimates. The 
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three sources of instability are: 

1) Non-unidimensionality of the item responses, 

2) Mis-specification of the item response model, and 

3) Inadequacies of the estimation procedures. 

Of these three sources, non-unidimensioriality has the most serious 
impact for test users. Under tk^is circumstance, items cannot be 
characterized as having uniquely identified parameters and examinee 
abilities estimated from any derived item parameters are left 
undefined as well. As an end result, one might be in no better 
position than if original raw number correct scores is used. In fact, 
one's position could be worse, in fact, if the test user were to act 
as if the ability estimates were item-free and sample-free» 

If the sources of instability are due to model mis-specification 



one can speak of true values for both item and ability parameters 
which are only being inaccurately estimated. In this case, increased 
stability may be obtained through relatively straightforward fixes, 
such as going from a one-parameter model to a three-parameter, or 
increasing sample sizes. However, more complicated solutions may be 
needed, such as the development of a new model with different types of 
parameters. 

Without the presence of external criteria it is difficult to 
separate out the various sources of instability; however, it is 
possible to gather circumstantial evidence that may enable one to 
deduce their relative effects. Under ideal circumstances, both item 



or estimation inadequacies, and not^due to non- 
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and examinee parameters should be estimable and stable regardless of 
the item and the examinees used in the estimation procedure. 
Therefore, one would expect that item parameters estimated from two 
separate' runs on independent samples of examinees should correlate 
very highly with one another. Likewise examinee abilities estimated 
for independent subsets of items but calibrated to the same latent 
trait scale should also correlate very highly with one another. If 
these high correlations are maintained across nonrandom samples of 
items and examinees, one can place considerably more confidence in the 
parameter estimate. 

With the Reading Comprehension Test data, the stability of item 
parameter estimates was investigated across independent random samples 
using different sample sizes in item sets. Table 3 contains the 
correlations for each of the three item parameters using different 
sample sizes. The correlations are between the item parameter 
estimates as they were derived from separate random samples of 
examinees. Thus for the 45-item Reading Comprehension Test, the 
Logist program produced 45 difficulty parameters for a sample of 1,0Q0 
examinees. Another Logist run was made with another sample of 1,000 
examinees, and again it produced 45 difficulty parameters. The 
correlation between these two "sets of difficulty parameters appears in 
Table 3 in the row labeled b. Similarly, correlations were produced 
for the discrimination and guessing parameters a and c. 
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Table 3 




Stability Correlation of 
Based on Sample Sizes of 


Item Parameter 
1,000 and 500 


N = 1,000 


N = 500 


a .72 
b .97 
c .64 


.70 
/ .95 
/ .35 

/ 


/ 


Table /4 




Stability Correlation af Item Parameter for 
Odd and Even Item Sets Base/d on Sample Sizes of 1,000 


Odd Item (N = 23) 


Even Items (N = 22) 


a .62 
b .97 
c .35 


.68 
.96 
.82 




Table 


5 


Stability Correlation of Item Parameter for 
Guessable and Non-Guossable Item Sets Based on 
Sample Sizes of 1,000 


Guessable (N = 14) 


Non-Guessable (N = 24) 


a .93 
b' .97 
C .82 


.38 
.91 
.25 
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The diffi-eu-Vti^parameter has the highest correlation (.9699), 
disc'rimlnation is next highest (.7225), and the guessing parameter is 
lowest (.5448). In order to investigate the effect of sample sizes on 
the stability of estimates similar correlations vvere produced with 
sample sizes of 500. Both a and b parameters maintained the same 
magnitudes (.9546 and .7027 respectively), but the correlation for the 
guessing parameter drops considerably (to .3502). This suggests the 
importance of -sample size in the estimation of the c parametep;-^' 



indicate room for improvement. 

Besides the effect of examinee sample sizes, the number of items 
being estimated may also have an effect on the stability of the 
estimation procedures. Because Logist utilizes maximum likelihood 
estimate procedures, the estimates are likely to be biased, especially 
when the total number of examinees by items observations are limited 
(Andersen, 1973). Table 4 illustrates the effect of reducing the 
number of items by half. Using sample sizes of 1,000, the 
correlations were calculated for odd items and again for even items. 
The stability of the difficulty parameters remains high (.97 and .96 
for the odd and even item sets respectively), but the stability of the 
discrimination parameters drops. Surprisingly, however, the c 
parameter stability goes up considerably for the even items but falls 
for the odd items. This appears to suggest that the stability of the 



item parameters independent of sample sizes has a lot to do with the 



however, the discrimination parameter 
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types of itehis included in the analysis. In other words, the 
unidimensionality of the items in the Reading Comprehension Test is 
questionable. 

Pursuing this line of reasoning, it was felt that the 45 items 
could be classified in some way to produce more homogeneous item 
sets. Because the influence of guessing has received quite a lot of 
attention in the application of the three-parameter model, one method 
of classifying the items is on the basis of their guessabili ty, that 
is, the likelihood of getting an item correct without possessing the 
requisite knowledg^. In order to classify the item as guessable, the 
45 reading items without their corresponding reading passages were 
presented to eight adult college-educated subjects. Guessable items 
were judged to be those for which seven of the eight subjects were 
able to answer correctly without having read the ^^assages, while 
non-guessable items were those which two or fewer subjects were able 
to get correct. 

In all, 14 items were classified as guessable, and 24 were 
classified as non-guessable. The resulting item correlations from the 
lEA exf-ninees are based on sample sizes of 1,000 and are presented in 
Table 5. The stability of parameter estimates goes up for all three 
parameters for the guessable items and goes down for the non-guessable 
items. The stability correlations for the discrimination parameter 
goes up considerably for the guessable items (to a respectable .93), 
and the correlation for the c parameter also goes up (to .82). For 
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the non-guessable items, the a and c parameters go down (to •SS and 
•25, respectively) which seem to indicate that the non-guessable items 
are non-unidimensional and that the non-unidimensional i ty is 
responsible for most of the instability of the item estimates. 

The strategy used in the preceding three-parameter analysis was 
principally one of deduction from a/ailable correlational evidence 
without the use of external validating criteria. The general 
conclusion for the Reading Comprehension Test data is that the 45 
items are not unidimensional and that such non-unidimensi onal i ty 
considerably affects the stability of Legist estimates. It should be 
noted, in particulaV, that this non-unidimens1onality would not have 
been detected through the estimation of difficulty parameters alone as 
would be produced by the Rasch analysis. 

The results of the three-parameter study also seemed to provide 
some evidence for the nature of the reading test behavior of the set 
of examinees. It seems that much of what is called reading ability 
depends on what the student brings to the reading situation, ive., his 
or her own experiences with and exposure to particular topics. This 
may underly the higher stability of the parameter estimates for the 
guessable items as contrasted with the non-guessable items. The 
non-unidimensionality of the latter should not be too surprising since 
examinees, presumably, must read the passages before they select an 
answer, and their subsequent abil ity to respond correctly to the item 
is probably a function of several of reading compmrehension and 
test-taking ::,:rategi es . 

2?f; 
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SUMMARY PAPER 
J. Ward Keesling 

I. Introduction: What should a measurement model provide? 

A. An assessment of the fit of the model 

B. Parameter estimates that capture information of importance 
about the elements of the model (e.g., person and item 
characteristics) 

1 . The estimated parameters for persons are the "measurements" 
in the model 

2. The estimated parameters characterizing items should 
provide insight about the items (e.g., their difficulty 
levels) and permit more sophisticated construction and 
interpretation of tests 

3. The special case of the multiple-choice item. The need 
for parameters to characterize distractors. 

C. Estimates of the precision of the parameter estimates— to help 
us understand the latter statistic:. 

D. Overview of the chapter 

II. Evaluation of the models given the above criteria 

A. Logistic models 

B. S-P model 

C. G-Theory (is this really a measurement model?) 

D. AUC models 
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III. An examination of the salience of the models to three types 
of use 

A. Assessing pupil progress in a classroom 

B. The norm-referenced evaluation 

C. The domain-referenced evaluation 

.(For .each, discuss the utility of the information in 
the various models, vs the cost of obtaining it. 
Attend especially to the potential of item banks.) 

IV. ^Implications of microcomputer technology 

(Review III, with a view to how technology could help/hinder) 

V. Summary 
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